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Abstract 

This research project undertook a review and synthesis of previous research on the effectiveness 
of feedback for individual writing development. The work plan was divided into two main 
phases. First, we surveyed all available studies that have investigated the effectiveness of writing 
feedback, including both quantitative and qualitative research, for students who have learned 
English as a first language (LI-English), students who have learned English as a second 
language (L2-English), and students who have learned second languages other than English. The 
results of this survey are described in a narrative overview of previous research pertaining to the 
role of feedback in the development of writing proficiency. The survey also identified the major 
theoretical constructs used in this research domain, providing the basis for subsequent statistical 
analysis. 

Second, we built on this survey to carry out a meta-analysis of empirical studies in this 
research area. The goal of the meta-analysis was to provide a quantitative investigation of the 
extent and ways in which feedback has been effective, summarizing the findings of previous 
quantitative studies that have employed suitable statistical measures. Several analytical steps 
were required for the meta-analysis: developing a coding rubric; analyzing the research design 
and adequacy of reporting in studies to detennine if they were suitable for inclusion; coding each 
study for all relevant research design factors; computing effect sizes for each study; and 
analyzing and interpreting the general patterns that hold across this set of studies. 

The meta-analysis compared the gains in writing development with respect to several 
different kinds of feedback. Overall, feedback was found to result in gains in writing 
development. Beyond that, there were several predictable findings (e.g., that written feedback is 
more effective than oral feedback for writing development) and several other more noteworthy 
trends (e.g., that peer feedback is more effective than teacher feedback for L2-English students; 
commenting is more effective than error location; and in general, focus on form and content 
seems to be more effective than an exclusive focus on form). 

Key words: feedback, writing development, meta analysis, commenting, error analysis 
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The Test of English as a Foreign Language™ (TOEFL 15 ) was developed in 1963 by the National 
Council on the Testing of English as a Foreign Language. The Council was fonned through the 
cooperative effort of more than 30 public and private organizations concerned with testing the English 
proficiency of nonnative speakers of the language applying for admission to institutions in the United 
States. In 1965, Educational Testing Service (ETS) and the College Board' assumed 
joint responsibility for the program. In 1973, a cooperative arrangement for the operation of the 
program was entered into by ETS, the College Board, and the Graduate Record Examinations' 
(GRE ' j Board. The membership of the College Board is composed of schools, colleges, school 
systems, and educational associations; GRE Board members are associated with graduate education. 
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1. Introduction 


Feedback is generally regarded as essential for writing development at all levels, from 
students at the kindergarten through 12 th grade (K-12) levels, to college freshman taking 
composition courses, to graduate students working on dissertation projects. Similarly, feedback 
has been considered essential for both first language (LI) and second language (L2) writing 
development. 

Despite this widespread perception, much less agreement exists on the kinds of feedback 
that actually make a difference, or even on the kinds of gains in proficiency that can be expected 
from feedback. Numerous papers advocate one or another approach, and many other studies 
describe a writing course where a particular approach was used. Many other papers adopt a 
(quasi)experimental approach, measuring gains in writing proficiency that result from feedback. 

Numerous factors must be considered in any study of feedback to determine which ones 
are actually influential. For example, feedback can be provided by the teacher, other students, or 
an automated system on a computer. Feedback can be written or spoken, and it can focus on 
content, organization, grammatical form, or usage (e.g., spelling). If written, feedback on form 
can comment on the existence of errors, identify the location of specific errors, or actually 
correct errors. And then, of course, questions must be addressed about how to measure potential 
improvements in writing performance resulting from feedback, for example, focusing on 
reduction in errors, the extent to which students incorporate revisions, or overall holistic 
assessments of writing quality. 

Recently, Hyland and Hyland (2006) carried out a comprehensive survey of research on 
feedback, identifying several of the most important issues and describing numerous studies that 
investigate those issues (cfi, DiPardo & Freedman, 1988). However, despite the large number of 
studies (over 200 in their survey), Hyland and Hyland concluded that there is surprisingly little 
consensus and most of the fundamental questions remain unanswered: 

While the research into feedback on L2 students’ writing has increased dramatically in 
the last decade, it is clear that the questions posed at the beginning of this paper have not 
yet been completely answered. [...] Nor are we a lot closer to understanding the long 
tenn effects of feedback on writing development, (p. 96) 

In part, this lack of consensus results from the diverse research designs and 
methodologies used in previous studies of feedback. However, an additional limitation has been 
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the lack of quantitative techniques to document the state-of-the-art in this research domain. That 
is, previous survey articles, such as Hyland and Hyland (2006), have relied on descriptive 
narratives to survey previous research in this domain. However, those surveys provided no 
quantitative analysis of the distribution of research approaches and designs within the domain. 
For example, how many of these studies have been qualitative reports versus quantitative 
empirical studies? How many of these studies have used experimental designs versus other kinds 
of quantitative comparisons? 

In fact, authors of state-of-the-art articles in applied linguistics usually pay little attention 
to the methods that they used themselves in carrying out the survey. That is, it has generally been 
assumed that the research for a survey article consists of finding as many publications on a topic 
as possible, determining the types of research and the main research issues represented by those 
studies, and then describing the studies that fall into each type. Such surveys rarely specify how 
articles were selected for inclusion in the review or provide any other evidence that the reader 
can use to evaluate the representativeness of the survey. Rather, the survey depends crucially on 
the expert knowledge of the authors. While such descriptions are a tremendous resource for 
future researchers beginning work in a particular domain, it is difficult to determine the extent to 
which the survey actually represents the research domain. 

To address these concerns, recent research in applied linguistics has begun to advocate 
systematic research syntheses, applying the techniques of meta-analysis that have been 
developed over the past few decades for social science research. Systematic research syntheses 
differ from traditional literature surveys in three major ways (Norris & Ortega, 2006, pp. 6-7; 
see also Norris & Ortega, 2007, pp. 807-8): 

1. The selection of studies to be included in the survey is a deliberate part of the 
research process, with explicit procedures for defining the population and identifying 
the research studies to be included or excluded from the survey. 

2. Each research study is critically evaluated for the appropriateness of its research 
design and application of statistical procedures (rather than uncritically reporting 
study conclusions). 

3. Each research study is analyzed with respect to the same set of design variables and 
values, applying a coding scheme developed for the entire meta-analysis. 
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A subset of the studies included in a systematic research survey will be suitable for a 
subsequent stage of analysis: a statistical meta-analysis based on comparison of effect sizes 
across studies. To be included in this stage of the research synthesis, a study must employ an 
experimental research design and be explicit and complete in its reporting standards. By 
comparing the magnitude of effect sizes across multiple studies in a research domain, it is 
possible to compare the importance of different factors based on the cumulative evidence of all 
empirical studies in the domain. 

Two recent studies have applied the techniques of statistical meta-analysis to study 
writing development. These studies included some infonnation on the effectiveness of feedback, 
although that was not the primary focus of either one. Truscott (2007) focuses on the quite 
restricted question of the extent to which error correction influences writing accuracy for L2- 
English students. This study concluded that overt error correction actually has a small negative 
influence on learners’ abilities to write accurately. However, the meta-analysis was based on 
only six research studies, making it somewhat difficult to be confident about the generalizability 
of the findings. 

The second study, Graham and Perin (2007) was much larger in scope but focused on 
writing instruction (for LI-English adolescent students) rather than the effectiveness of feedback. 
As a result, that study considered factors such as different instructional approaches (e.g., writing 
as product versus process); explicit instruction in grammar, sentence combining, writing 
strategies, and so on; prewriting activities; and the use of word-processing for writing. The only 
factor in that study that was directly relevant to the present inquiry was peer assistance, which 
was identified with a moderately large increase in writing quality. 

The present report focuses exclusively on the influences of feedback for writing 
development, providing a large-scale and systematic synthesis of research on this topic. Because 
of the need to follow explicit procedures at all stages, the report is organized somewhat 
differently from a traditional literature review. In Section 2, we document the procedures that we 
used to describe the research domain and to attempt to construct an exhaustive catalog of 
research studies within that domain. We then describe our initial coding scheme, identifying the 
major ways in which studies of writing feedback can differ from one another. We also discuss 
the research designs and reporting standards that are required for a study to be suitable for 
inclusion in the statistical meta-analysis. 
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In Section 3, we provide an empirical survey of research studies in this domain, including 
discussion of the breakdown of studies across the major variables included in our coding scheme. 
Section 3 also describes the subset of studies that are appropriate for statistical meta-analysis of 
their effect sizes. 

In Section 4, we turn to the procedures used for the statistical meta-analysis. Numerous 
analytical decisions are required for this stage of the synthesis, and our goal here is to describe 
those as fully and explicitly as possible. 

Section 5 provides the most important information from this synthesis: the results of the 
statistical meta-analysis. In this section, we compare the magnitude (and dispersion) for the 
effect sizes of several different factors that have been hypothesized to influence the effectiveness 
of feedback for writing development. Based on these analyses, we are able to provide an overall 
perspective on the influence of feedback, identifying factors that seem to make a difference for 
writing development and those that seem to be less influential. 

A summary and discussion of the statistical meta-analysis is taken up in Section 6. 

2. Procedures I: Describing the Research Domain 

The first major stage of this project was to describe the research domain. Research for 
this stage was carried out in three steps: First, we conducted a literature search to identify all 
relevant studies, employing the procedures described in Section 2.1. Second, we developed an 
explicit coding scheme, described in Section 2.2, that included all major variables represented in 
the research designs of these studies and the major values that were distinguished for those 
variables. Finally, we coded each research study for all variables in our coding scheme. 

2,1. The Literature Search 

The literature search began with an operational definition of the population of studies, 
followed by a comprehensive sampling of studies in that population. For the purposes of this 
search, we attempted to identify all studies that addressed the central research question of our 
research synthesis: 

Which kinds of feedback are influential for which kinds of gains in writing proficiency? 

This research question has two main components: (a) the different operationalizations of 
feedback (kinds of feedback) and (b) the range of outcome measures (kinds of gains in writing 
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proficiency). Thus, we included articles that investigated different sources of feedback (e.g., 
teacher, peer, computer), as well as different forms of feedback (e.g., direct correction, editing 
codes, highlighting) delivered in different modes (i.e., spoken, written, and computer mediated). 
For similar reasons, in addition to articles that report development in tenns of writing proficiency 
measures, we also included articles that reported results from other outcome measures (e.g., 
surveys of student attitudes, analyses of post-feedback revisions). 

We included studies of both native and/or nonnative English speaking students (including 
developing and remedial writers). Our goal in doing this was to allow comparison of the two 
populations, asking whether feedback is influential in the same ways and to the same extent in 
LI and L2 populations. 

Location and selection of research studies. The first step in our survey was to identify 
research journals that could potentially publish articles on feedback. This was done by exploring 
library catalogs and databases and by including any journal cited in previous survey studies. The 
following journals were included in this step: 

Academic Writing Across the Disciplines Applied Linguistics 

Assessing Writing Australian Journal of Language and Literacy 

British Journal of Educational Technology CALICO Journal 

CALL Electronic Journal Canadian Modern Language Review 

College Composition and Communication Computer Assisted Language Learning 

Computers and Composition Computers & Education 

ELT Journal English for Specific Purposes 

English Journal Foreign Language Annals 

International Review of Applied Linguistics Issues in Writing 

Journal of Basic Writing Journal of Educational Psychology 

Journal of Educational Research Journal of English for Academic Purposes 

Journal of Second Language Studies Journal of Second Language Writing 

Jnl of Technical Writing & Communication Language Learning 

Language & Learning Across the Disciplines Language Teaching 

Language Teaching Research Modern Language Journal 

ReCALL Research in the Teaching of English 

RELC Journal Rhetoric Review 
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Second Language Research Spaan Fellow Working Papers 

Studies in Second Language Acquisition System 

Teaching English in the Two Year College TESL Canada Journal 
TESL-EJ TESOL Journal 

TESOL Quarterly Writing Center Journal 

Written Communication 

For each of these journals, we searched the online table of contents to identify all articles 
that had any of the following keywords: feedback, response, comment(ing), revision, peer, and 
writing. The range of dates searched was dictated by the archival status of individual journals but 
in general spanned the period 1980-2007. In addition, an online search of the ERIC database was 
conducted using the keywords writing and feedback. Our literature search focused primarily on 
studies published in academic journals. Further, as individual articles were being analyzed, the 
list of references in each was reviewed to identify additional articles (including studies published 
in edited books) that had not yet been collected. The studies included in our literature survey are 
mostly published research articles; we made no systematic attempt to include studies from the 
“fugitive” literature (e.g., unpublished papers, dissertations, conference presentations), apart 
from research papers identified through the ERIC database. 

Using these methods, we were able to collect articles representing a variety of 
epistemological traditions. Unlike the methods used for some other meta-analyses, we did not 
adopt a priori exclusion criteria regarding research methodology (e.g., accepting only 
experimental or quasi-experimental studies). Instead, the articles included in our survey ranged 
from tightly controlled experimental studies to qualitative case studies. This inclusive approach 
allowed us to evaluate the maturity of the research domain before selecting empirical studies for 
the quantitative meta-analysis. 

While articles were not excluded from the survey on the basis of research methodology, 
we did exclude studies that were not in the research domain of focus here. In particular, we 
excluded the following: 

• studies focusing on oral (rather than written) production; 

• studies in which computer-mediated chat was the target of feedback (because 
engaging in chat is a different communicative enterprise from the writing tasks 
normally considered in studies of writing development). (Note: we did include studies 
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that investigated chat as the means through which feedback on writing was 
delivered.); and 

• studies focusing on the writing of special-needs student populations (e.g., deaf 
students). 

2.2. Identifying the Parameters of Variation Among Research Designs and Coding the 
Study Reports 

The central research question motivating this research synthesis has two main 
components: the different kinds of feedback and the different measures of improvement in 
writing proficiency. We thus began this project by carrying out preliminary research on how 
these two constructs have been approached in previous research. 

Then, with that background, we developed an explicit coding rubric. The goals of this 
step were to identify all important factors that varied across feedback studies (e.g., age of the 
subjects, type of writing task required, type of feedback provided) and to itemize the possible 
values for each of those variables. This rubric was developed inductively, by reading through a 
wide sample of research studies to identify various ways in which their research designs could 
vary. The rubric was subsequently applied for an empirical description of this research domain 
(described in Section 3). 

Operationalizations of feedback in the research literature. On initial consideration, 
feedback might seem to be a simple construct—providing a constructive evaluation of writing 
quality to the student. However, in actual practice and in the research literature, an extremely 
wide range of variation was found in the actual realization of feedback. These differences can be 
described with respect to five variables: type, focus, tone, mode, and source. 

Type of feedback. In research on traditional teacher-generated feedback, the distinction 
between direct and indirect feedback has been one focus of studies in the areas of writing and 
second language acquisition (SLA) research (e.g., Ferris, 2003, 2006; Ferris & Roberts, 2001; 
Robb, Ross, & Shortreed, 1986). The term direct feedback is used to denote instances where the 
writing instructor makes an explicit correction to the student’s text (e.g., by writing in the correct 
grammatical form), while indirect feedback denotes instances where the instructor indicates that 
something about the student’s writing is problematic (e.g., by underlining an ungrammatical 
construction and/or marking the problematic section of text with a special code) but does not 
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provide an immediate correction. In actual practice, direct feedback is rarely used as a treatment 
in empirical research, while numerous types of indirect feedback have been investigated. These 
include identifying the location of problems, providing comments in the margins, global 
comments at the end of a paper, and even oral comments given to the student. 

Focus of feedback. This area of research has dealt with the features of student writing 
(e.g., lexis, grammar, mechanics, organization, content) that the feedback provider chooses to 
focus on. As noted above, much feedback research has focused on error correction. Researchers 
on second language writing research distinguish between grammatical and word choice errors, 
because such “L2 errors” are thought to stigmatize L2 users. For example, Ferris (1999) divided 
such errors into two classes, which she labeled treatable and untreatable. Treatable errors are 
those that can be addressed through explicit instruction and include language features such as 
article usage and subject-verb agreement (i.e., rule-governed constructions). Untreatable errors 
are those that are less readily teachable in that they are not governed by a clear or simple set of 
rules. Problems with word choice are one example Ferris gave of such untreatable errors. 

The predominating emphasis on error correction seems to be motivated by the perceived 
severity of different error types among readers of L2 texts. However, not all teacher comments 
address aspects of student language use that can be objectively characterized as incorrect or even 
problematic (e.g., positive feedback, clarification questions). Furthermore, many student writers 
desire guidance in these additional areas, especially as they reach more advanced levels of 
writing proficiency (Leki, 2006). While feedback on surface level errors may be comparably 
easy to provide (both for human teachers and computer programmers), an important question is 
whether this type of feedback leads to greater gains in student writing proficiency than more 
holistically focused feedback on text content, organization, or audience/purpose. 

Tone of feedback. Following from the idea that not all feedback focuses on student 
errors, it is also the case that feedback can vary in the degree to which it praises areas of strength 
or criticizes areas of weakness (see, e.g., Hyland & Hyland, 2001). Concern has been expressed 
in the literature that overly negative feedback will adversely affect the student’s motivation. At 
the same time, it is possible that some students may view positive feedback as less useful than 
critical feedback that identifies features of their writing that need to be revised. Thus, an 
important concern for instructors is determining the best tone for constructive criticism, given (a) 
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the nature/amount of feedback that needs to be provided and (b) the nature of the teacher-student 
interpersonal relationship. 

Mode and source of feedback. Finally, feedback can be provided through any available 
channel, or mode : oral, written, or computer mediated. Although it has not been a major factor in 
previous research, several studies considered the influence of one mode of delivery over another. 
Similarly, feedback can be provided by the teacher or by other students, or even generated 
automatically by computer. 

Operationalizations of writing development: Outcome measures of the effects of 
feedback. To demonstrate the effectiveness of feedback, researchers have used measures of 
writing proficiency (e.g., Chandler, 2003; Min, 2006), as well as survey instruments designed to 
elicit student perspectives (e.g., Ferris, 1995). Writing proficiency measures that have been used 
in feedback research included ratings obtained from classroom teachers and/or trained judges 
using holistic and analytic rating scales, as well as other measures of syntactic and lexical 
complexity. Student perspectives or changes in student attitudes have been elicited using both 
qualitative approaches (e.g., interviews) and quantitative instruments (e.g., surveys). A third 
approach to analyzing the effectiveness of feedback has been to tabulate the number of suggested 
revisions that were ultimately adopted by the student in subsequent drafts (e.g., Min, 2006). 

The coding rubric. The first major stage of a systematic research synthesis is to 
undertake an empirical analysis of the research domain, documenting the ways in which the 
central research questions have been approached within that domain. This description is also 
required to evaluate whether the research domain is sufficiently mature to permit a statistical 
meta-analysis. In the present case, our preliminary reading indicated that much of the research on 
writing for the past two decades has eschewed quantitative methods in favor of more qualitative 
approaches, especially in the area of first language composition research. While qualitative work 
adds to our collective understanding of how students develop their writing skills, such studies 
cannot be included in a quantitative meta-analysis. Thus, the ultimate goal of the analysis in this 
stage is to detennine whether enough experimental studies—with clearly documented research 
designs and statistical results—exist to permit the application of meta-analytic techniques. 

To accomplish the empirical analysis of the research domain, it is first necessary to 
develop a coding rubric that itemizes all important factors that vary across feedback studies, as 
well as all possible values for each of those variables. This rubric is developed inductively, by 
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reading through a wide sample of research studies to identify various ways in which their 
research designs could vary. 

The coding rubric developed for the present project includes 16 variables: 

• Research paradigm 

• Statistical analysis 

• Design variables 

• Target language 

• Proficiency level (for L2 studies only) 

• Number of student participants 

• Age/grade level of student participants 

• Genre of writing task(s) 

• Length of writing task(s) 

• Source of feedback 

• Mode of feedback 

• Focus of feedback 

• Tone of feedback 

• Type of feedback 

• Outcome measures 

• Specific focus for outcome measures of writing proficiency. 

These variables were used to categorize the types of research studies in this research 
domain, and thus it was necessary to develop an exhaustive list of values for each variable. These 
values and variables are shown in Table 1 (with the codes used for the meta-analysis given in 
square brackets; see Appendices B and C). 
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Table 1 

The Coding Rubric: Variables and Values for Each Variable 


Variables 


Values 


Further details 


Research paradigm 

Quantitative 

Experimental; quasi-experimental; 
correlational; survey 


Qualitative 

Ethnographic; case study; interviews 


Mixed methods 

Combination of quantitative measures and 
qualitative description 


Thought piece 

Theoretical argument; pedagogical primer; 
no original empirical data 

Statistical analysis 

Statistical tests reported 

Statistical tests not reported 

Record statistic(s) used, including descriptive 
statistics 

Design type 

Intact group(s) 

Random group assignment 

One group 

Treatment/control [TC] 

Pretest/posttest [PP] 

Posttest only 

Descriptive/ex-post facto 


Target language 

LI English [El] 

English composition studies where no 
mention of nonnative speaker (NNS) 
participants is made 


L2 English [E2] 

Most North American L2 writing studies 


Mixed LI & L2 [MX] 

Comprises native speaker (NS) & NNS 
students 


LI Other [01] 

Students whose native language is not 
English, learning to write in their native 
language (e.g., a study of the composing 
processes of Dutch LI children) 
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Variables 


Values 


Further details 


L2 Other [02] 


Proficiency level 

Proficiency level reported 

(for L2 studies only) 

Proficiency level not 
reported 

Number of student 

A-size reported 

participants 

iV-sizc not reported 

Age/grade level of 

Age or grade level reported 

student participants 

Age not reported 

Genre of writing 

Correspondence 

Task 

Creative 


Pedagogical [PD] 


Personal 


Research/academic 


Other genres [0] 


Genre not reported 

Length of writing 

Length reported 

Task 

Length not reported 

Source of feedback 

Teacher [TE] 


Peer [PE] 


Tutor 


Students whose native language is English, 
learning to write in a second language (e.g., a 
study of U.S. college students learning L2 
Spanish composition) 

[L] = Low; [H] = High 

Number of participants reported in study 


Business letters; personal letters; email; 
memos; faxes 

Fiction; poetry 

“Learning” genres, such as five-paragraph 
essays 

Diaries; journals; reflective essays 

Scientific research articles; dissertations; 
theses; term papers 

Legal writing; journalism 


Feedback from course instructor 
Feedback from another student 
Feedback from writing center tutor 
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Variables 


Values 


Further details 


Student 

Computer 

Other[O] 

Source not reported 
Mode of feedback Oral [OR] 

Written [WR] 

Computer-mediated [CM] 
Mode not reported 
Focus of feedback Grammar 

Vocabulary 

Spelling 

Organization [O] 

Content [C] 

Punctuation / mechanics 

Other 

Form [F] 

Content and form [C,F] 

Focus not reported / 
specified 

Tone of feedback Negative 


Self-correction 

Computer-generated feedback (not just 
computer -mediated feedback) 


Face-to-face conferencing; tape-recorded 
comments 

Marginal comments; end comments; editing 
codes; circles/underlines 

Internet chat; email 


Subject-verb agreement errors; tense/aspect 
errors; pedagogical grammar issues in LI 
studies 

Collocation errors; other word choice issues 
Spelling errors 

Topic sentence; discourse markers; 
transitions; paragraphing; conclusion; order 
of content 

Correctness of content; completeness of 
content 

Comma errors; end punctuation errors; 
indentation; capitalization; but not spelling 

Anything not captured by the other values for 
this variable 

Grammar, spelling, punctuation 
Content and form 


Comments on what the student has done 
wrong 
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Variables 

Values 

Further details 


Positive 

Comments on what the student has done 
right 


Mixed 

Comments on both strengths and weaknesses 
of text 


Tone not reported/specified 


Type of feedback 

Location of 
error/problem/issue 
indicated [LO] 

Location of an error is marked (circled, 
underlined), but no feedback is given on why 
it is an error or how it might be corrected 


Comment [CM] 

Teacher/peer writes prose comments in the 
margin or at the end of the paper 


Other [0] 

Other types of feedback, including direct 
correction/reformulation [DC]; editing codes 
[EC], error existence [EX], metalingusitic 
explanation of an error [ML], spoken explicit 
comments [SE], spoken implicit comments 
[SI] 


Multiple [M] 

Multiple types of feedback are provided, 
such as both location and explanatory 
comments 

Outcome measures 

Writing proficiency 
measures [WP] 

Holistic ratings of writing quality, measures 
of spelling accuracy, grammatical accuracy 


Attitude measures 

Likert-scale items 


Records of composition 

strategies/processes 

employed 

Records of time spent planning, drafting, 
etc.; eye-tracking records 


Records of revisions [RV] 

Number/extent of revisions made 


Other [0] 


Focus for outcome 
measures of writing 
proficiency 

Grammar [GR] 



Spelling [SP] 



Holistic [H] 



Content [C] 
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Most studies included in our study involved revisions made to an essay based on the 
same prompt over a period of time in response to different kinds of feedback. (McGroarty and 
Zhu 1997 was exceptional in this regard, because they evaluated writing development across 
essays based on different prompts.) The outcome measure for most quantitative studies was a 
measure of writing proficiency (either holistic quality or grammatical accuracy) based on 
evaluation of the final (revised) written product. However, in a few cases, studies simply 
documented the extent to which a student made any revisions, regardless of the contributions 
those revisions made to the quality of the final product. 

Coding the studies. The initial coding of studies for general variables, such as the 
research approach, general design type, and target population, was carried out by the second and 
third authors of the report (TN and BH). Any controversial coding decisions were discussed by 
all three authors and resolved through consensus. 

Subsequently, more detailed coding was undertaken by the second author (TN) for the 
purposes of the quantitative meta-analysis. The first step for this process was to identify the sub-set 
of studies that were suitable for inclusion in the analysis: studies that were published in the last 25 
years, used quantitative measures, had an experimental (or at least quasi-experimental) design, were 
explicit about the types of feedback that were provided, employed a clear basis for comparison, and 
included an outcome indicator that measured change in students’ writing proficiency or behavior 
(see Section 4.1 below). Any controversial coding decisions during this process were resolved 
through consensus by discussion between the first two authors (DB and TN). 

3. Empirical Survey of the Research Domain 

Based on the sampling methods described in Section 2.1, we collected a total of 306 
articles that addressed the effectiveness of feedback for writing development. Our goal here was 
to obtain an exhaustive sample of studies published in the last 30 years, resulting in a much 
larger collection of publications than in some previous meta-analyses. 

Figure 1 shows the breakdown of studies across year of publication. The trend here 
shows a dramatic increase in the number of feedback studies over the past 30 years. This trend 
reflects two factors. First is the general information explosion, with an increase in the number of 
academic journals and publications in all disciplines, and the more specific increase in the 
number of studies investigating the effect of feedback. Second, and more important for us here, 
is that this increase suggests researchers (and teachers) have shifted away from an uncritical 
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Figure 1. Number of publications by date. 

belief in the effectiveness of feedback toward a recognition that feedback can take many 
different shapes and its effectiveness needs to be studied in its own right. (The last period 
includes only 2.5 years, accounting for the apparent decrease in publications.) 

Figure 2 shows that studies in this domain have adopted the full range of research 
methodologies, with both qualitative and quantitative approaches represented by a large number 
of studies. In addition, numerous thought pieces —either survey articles describing previous 
research on feedback or general discussion articles—are included. 

Figure 3 shows that the relative preference for one or another research approach has 
remained relatively constant across time, with quantitative studies being slightly more common 
than qualitative studies. The one notable departure from this pattern is in the period 2000-2004, 
which showed a dramatic increase in the number of qualitative studies while the number of 
quantitative studies remained constant. This shift might reflect a more general paradigm shift 
influenced by postmodern thinking in general, valuing ethnographic reports of individual case 
studies over reports of the general trends in a large sample of individuals. Because comparatively 
few studies are included in the most recent period, it is not clear whether this trend continues. 
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Figure 2. Number of research publications by research approach. 
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Figure 3. Number of publications from each research approach by date. 
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Studies have also varied in the target population that has been investigated (although 
many studies do not provide full details on the subjects). For example, Figure 4 shows that 
learners of English have been the primary target of investigation, although there have also been 
numerous studies of feedback that focused on the writing development of native English 
speakers. However, there has been a shift in research focus across time, as shown in Figure 5: 
Through the 1980s, equal interest was found in the influence of feedback for both LI and L2 
learners of English (although the number of studies is comparatively few). However, by the mid- 
1990s, a dramatic shift in focus occurs with many more studies focusing on learners of English 
than on the writing development of native English speakers. 



Figure 4. Number of publications by target population 
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Figure 5. Number of publications focusing on LI versus L2 learners by date. 

Feedback studies that were focused on learners of English investigated the full range of 
proficiency levels, as shown in Table 2. 

Table 2 

Breakdown of Studies by Proficiency Level of the Target Population 
(Includes Only Studies of English Learners) 


Proficiency levels 

Number of studies 

Low/beginner 

17 

Intermediate 

19 

Advanced 

26 

Mixed 

21 

Unspecified 

44 

Total 

122 



LI English 
■ English learners 
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Table 3 shows that the majority of feedback studies (whether LI or L2) have focused on the 
writing of college-aged students, while comparatively few studies have investigated the influence 
of feedback for younger students. 

Turning to the nature of the student writing. Table 4 shows that the overwhelming majority 
of feedback studies have used pedagogical writing tasks, such as the five-paragraph essay. 


Table 3 


Breakdown of Studies by Age of the Target Population 


Target population age 

Number of studies 

Ages 4-9 

8 

Ages 10-12 

10 

Ages 13-18 

15 

College-age 

159 

Other adult ages 

29 

Unspecified 

85 

Total 

306 


Table 4 

Breakdown of Studies by the Genre of the Writing Task 

Genre of writing task 

Number of studies 

Pedagogical 

187 

Personal correspondence 

7 

Personal journal or diary 

7 

Creative writing 

2 

Research/academic 

8 

writing 

Other/unspecified 

95 

Total 

306 
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Since the central research question for this project focuses on the effectiveness of 
feedback, we coded five different variables to capture the different ways in which feedback was 
realized: source, mode, focus, type, and tone. As Table 5 shows, the large majority of these 
studies focus on feedback given by the teacher. In this regard, feedback studies probably 
reflected typical classroom practice, but they were at odds with many theoretical discussions that 
advocated the utility of peer feedback. 

Most feedback on student writing was communicated in writing, either using a computer 
(32 studies) or with feedback written by hand (94 studies). Many studies did not report the mode 
of feedback; only 37 studies reported giving oral feedback, and an additional 36 studies used 
multiple modes. 

Most of the studies that reported on the focus of feedback compared the influence of 
multiple categories (80 studies—usually a focus on both form and content). However, most 
studies did not report a specific focus, while only 14 studies had a single focus: 3 on content, 9 
on grammatical form, and 2 on spelling. 

Similar to the incomplete reporting typical of the other parameters, most studies in our 
sample did not report the particular type of feedback. For the remaining studies, Table 6 shows 
that written comments are the most common type of feedback, while a large number are also 
based on multiple types of feedback. 


Table 5 

Breakdown of Studies by the Source of Feedback 


Source of feedback 

Number of studies 

Computer 

18 

Peer/other students 

34 

Self criticism 

17 

Peers + self 

6 

Teacher 

119 

Teacher + other 

38 

Other/unspecified 

74 

Total 

306 
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Table 6 


Breakdown of Studies by the Type of Feedback 


Type of feedback 

Number of studies 

Comments 

66 

Error code 

2 

Direct correction of error 

11 

Location of error 

4 

Multiple types 

35 

Other/unspecified 

188 

Total 

306 


Only 17 of the 306 studies in our sample reported on the tone of feedback. Of those, 15 
claimed to provide both positive and negative feedback, and 2 provided only positive feedback. 

It seems unlikely that this emphasis on positive feedback was equally typical for the 289 studies 
that did not report on tone. 

Finally, we noted in Section 2.1 above that the central research question motivating this 
research synthesis has two main components: the kinds of feedback (described in the preceding 
paragraphs) and the resulting gains in writing proficiency. Table 7 shows that there is very little 
agreement on the best way to operationalize writing proficiency or development. 102 of the 
studies in our sample provided no specific measure of writing development. The remaining 204 
studies, though, used a wide range of different measures, including questionnaires to determine 
student attitudes, direct comments on progress from teachers or students, and a record of the 
extent to which essays have been revised. Surprisingly few studies included a direct measure of 
writing quality, which might include scores for grammatical accuracy, content, organization, or 
an overall holistic rating for quality. 

In sum, our survey of research relating to feedback on writing shows that considerable 
depth exists in this research domain, with numerous studies undertaken from multiple 
perspectives. About half the studies in this research domain have been quantitative, and those 
studies have included many variants of research design. There are advantages to this diversity, in 
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Table 7 


Breakdown of Studies by the Outcome Measures of Writing Development 


Outcomes used to measure writing development 

Number of studies 

Attitude measures 

35 

Revisions 

39 

Attitude measures plus revisions 

12 

Composition strategies 

7 

Composition strategies plus revisions 

3 

Essay score for quality or grammatical accuracy 

40 

Essay score plus attitude measures 

18 

Essay score plus revisions 

18 

Teacher or student comments on progress 

32 

Other/unspecified 

102 

Total 

306 


that each new research study considers slightly different research questions from preceding 
studies. For the purposes of a quantitative meta-analysis, however, this diversity, which depends 
on the existence of multiple studies that are directly comparable, also presents disadvantages. 

The following section turns to the methods of meta-analysis and an evaluation of this 
research domain to determine if it is suitable for this approach. 

4. Procedures II: The Quantitative Meta-Analysis 

The meta-analysis proceeded in three major steps: 

1. All publications in the larger sample were examined to identify the set of studies that 
were suitable for inclusion in a meta-analysis. 

2. Effect sizes were computed for the outcome variables in each of those studies. 

3. Mean effect sizes were computed for each treatment variable as the basis for 
determining the influence of different forms of feedback on writing development. 

We describe each of these methodological steps in turn in the following subsections. 
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4.1. Identifying the Subset of Studies That Are Suitable for Meta-Analysis 

During the coding of research articles described in Sections 2 and 3, we made an initial 
determination of whether a study was potentially suitable for inclusion in the quantitative meta¬ 
analysis. There were four major requirements for this initial screening (following the procedures 
used in Norris & Ortega, 2000, pp. 432-33): 

1. The study was published in the last 25 years (between 1982 and 2007). 

2. The study used quantitative measures and had an experimental (or at least quasi- 
experimental) design. Specifically, the study had to use and report on quantitative 
measures of effectiveness, for specific types of feedback. 

3. The independent variables measured feedback characteristics, including source of 
feedback (e.g., teacher, peer, tutor, student, computer), focus of feedback (e.g., 
grammar, vocabulary, spelling, organization, content, mechanics, rhetorical 
organization), or type of feedback (direct comment, editing code, rating, etc.). 

4. The dependent variables included an outcome indicator that measured the impact of 
specific types of feedback on participants’ writing behavior, including writing 
proficiency (e.g., grammatical accuracy or holistic quality rating), increase in text 
length, attitude, strategies/processes employed, number/extent of revisions made. 

Based on these criteria, 112 studies were identified as potentially relevant for the meta¬ 
analysis. These studies were then subjected to a second round of closer scrutiny to detennine if 
the design and reporting standards were in fact adequate for our purposes here. Unfortunately, it 
turned out that a large number of additional studies were excluded in this second phase, for the 
following reasons: 

1. The research design was not suitable for inclusion. That is, following Norris and 
Ortega (2000), we included only studies with designs based on mean differences: a 
pretest/posttest design, or a control group/experimental group design. 

Several studies were excluded because they used correlational designs. Although it is 
possible to compute effect sizes from such designs, these studies addressed 
fundamentally different kinds of research questions, and so they could not be readily 
compared to the effect sizes from group-comparison studies. Twenty-four studies 
were excluded for this reason. 
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2. The study addressed a different research question from the one that we are 
investigating in our project (e.g., studies on whether males/females produced more 
errors or studies on whether grading rubrics are biased to favor males or females). 

Some of these studies had a pre-post test design, but no actual feedback was provided. 
Sixteen studies were excluded for this reason. 

3. The study was incomplete in its reporting of the design, sample, or statistical findings. 
Specifically, to be included in the meta-analysis, the study must report one of the 
following: (a) the sample size, mean scores, and standard deviations for each group, 

(b) between-groups t ox F values together with df, or (c) individual scores on outcome 
measures for all participants. Twenty-four studies failed to meet these reporting 
standards for statistical tests (e.g., reporting only significance with no df or no t value, 
or reporting mean scores with no standard deviation); these studies were thus 
excluded. 

4. The study provided no clear basis for comparison. These were mostly studies of a 
single group that reported only posttest results. Fourteen studies were excluded for 
this reason. 

5. The study compared multiple treatment groups with respect to a single posttest with 
no pretest and no control group. For example, one group received feedback on 
content, while a second group received feedback on form; or one group received 
direct correction of errors, while only general comments were provided to a second 
group. Although these studies addressed some of the central research issues of our 
synthesis, they could not be included in the meta-analysis because it was not possible 
to isolate the influence of individual factors. As Norris and Ortega (2006) noted: 

“Direct comparisons between treatment conditions are not made, because they would 
be idiosyncratic to the particular study, and therefore not comparable with other 
studies that did not operationalize the same two treatments” (pp. 27-28). Eleven 
studies were excluded for this reason. 

In sum, 89 additional research studies were excluded at this stage, leaving only 23 published 
papers (reporting on 25 different studies) that were directly comparable and otherwise suitable for 
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inclusion in the meta-analysis. At this point, the large majority of studies in this research domain 
were noted as not suitable for inclusion in a meta-analysis for three general reasons: 

1. Many studies in this domain were qualitative (and often anecdotal), or thought pieces, 
based on researchers’ observations and perceptions. 

2. Many of the quantitative studies were not carefully designed, or the reporting 
standards were not adequate for the purposes of meta-analysis. 

3. Several studies were carefully designed and implemented, but they simply addressed 
different research questions from the one this study focuses on. 

Thus, although we were able to identify a large number of research studies in our initial 
survey (306 studies), relatively few of these could be used in the subsequent meta-analysis (only 
25 studies). 1 

4.2. Computing Effect Sizes for the Outcome Variables 

The second step in the meta-analysis was to compute an effect size for each outcome 
variable that reflects the influence of feedback. Again following Norris and Ortega (2000, 2006), 
Cohen’s d-index was selected as the most appropriate effect size estimate and calculated for each 
finding related to feedback that was reported with sufficient data. Cohen’s d represents the size 
or importance of a difference, either between a treatment group and a control group, or between a 
pretest and a posttest. (Correlational designs were not included in the final meta-analysis because 
they are not comparable to group comparisons designs.) In either case, this difference is 
interpreted as reflecting the influence of some treatment. Cohen’s d is essentially a kind of 
standard score representing standard deviation units. It is calculated for a specific outcome 
measure by subtracting the mean scores for the two groups and then dividing this difference by 
the pooled standard deviation of the two groups. (There are numerous reference works that 
provide specific formulae to be used for the computation of effect size from different primary 
statistics; see e.g., Cohen, 1988; Lipsey & Wilson, 2001; Norris & Ortega, 2000; Rosenthal, 
Rosnow, & Rubin, 2000). 

The studies included in our meta-analysis are about evenly split between studies with 
treatment-control designs and studies based on comparison of a pretest and posttest given to a 
single group (i.e., with no control group; see Section 5.1 below). Treatment-control designs 
(independent samples) are much stronger, allowing the researcher to isolate the influence of 
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feedback (the treatment) apart from other factors. In contrast, pretest versus posttest (dependent 
samples) designs that include only a single group are relatively weak because there is no control 
for the influence of natural development that occurs over the course of the study (see Section 5.2 
below). For this reason, our analyses in the following sections distinguish between these two 
design types to the extent that it was feasible, reporting separate mean effect sizes for each type. 
In general, the results are consistent across both treatment-control and pretest-posttest designs, 
but the results for the latter should be interpreted with caution. 

The computation of effect size also differs for the two design types (although both are 
referred to as Cohen’s d). For studies that employed a treatment-control design, we used an 
online calculator to compute effect sizes (Becker, 1999): 

http : / /web . uccs . edu/lbecker/Psy5 90/escalc3 . htm 

This calculator uses the following standard formula (which is consistent with Cohen’s (1988, p. 44) 
formula): 


Cohen’s d = (Mi - M 2 )/ a poo ied 
where a poo ied = V[(a i 2 + a 2 2 ) / 2] 

The reporting standards for all treatment-control studies included in our sample were high, so all 
studies included the descriptive statistics required for the fonnula (i.e., the mean scores and 
standard deviations for each group). 

Studies that employ dependent-sample designs (i.e., pretest/posttest designs) are more 
problematic. In theory, the original individual scores for each subject should be used to compute 
effect sizes for this type of design. That is, the appropriate formula for computing the pooled 
standard deviation for dependent samples designs is given below (Dunlap, Cortina, Vaslow, & 
Burke, 1996): 


a = V[Z(2f - M) 2 / N] 

An alternative approach, advocated by Lipsey and Wilson (2001, p. 41-51), is to use the mean 
scores and standard deviations for Time 1 and Time 2 to calculate the pooled standard deviations 
and effect sizes. 
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However, none of the studies included in the present meta-analysis provided original 
subjects’ scores, and most of the dependent-sample studies neglected to report the standard 
deviations for Time 1 and Time 2. This situation arises frequently in meta-analyses, and one 
practical solution has been to use a simplified independent samples formula—the t score divided 
by the square root of N —as an estimate of effect size for the dependent-sample designs. This 
approach, which has been followed in studies like Norris and Ortega’s (2000) meta-analysis on 
the effectiveness of L2 instruction, was adopted in the present study. However, it has been shown 
that computing Cohen's d from the t score and sample size results is an overestimate of the true 
magnitude of the effect size (see Dunlap et ah, 1996), providing an additional reason why the 
results for studies with dependent-sample designs should be interpreted with caution in our 
analysis. We therefore report the results for dependent-sample designs separately from the results 
for true experimental designs. 

In practical terms, a Cohen’s d of 1.0 means that the treatment group scored one standard 
deviation higher than the control group (or that there was a gain of one standard deviation from 
the pretest to the posttest.) Thus, converting all statistical differences to standard deviation units 
makes it possible to directly compare outcomes across studies. 

No absolute standards are used to interpret effect sizes. The most widely accepted rule of 
thumb, proposed by Cohen (1988), is based on a survey of the typical findings in social science 
research: Effect sizes of d < 0.20 are interpreted as insignificant, values of <7 between 0.20 and 
0.50 are interpreted as small effects', values of d between 0.50 and 0.80 are interpreted as medium 
effects', and values of d larger than 0.80 are interpreted as large effects. However, Norris and 
Ortega (2006, pp. 33-34) advocated a stricter standard, based in part on their findings in Norris 
and Ortega (2000), where it seemed that effect sizes around 1.0 were more typical for L2 
instructional treatment studies. 

Dunlap et al. (1996) showed that effect size estimates based on correlated designs (i.e., 
pretest versus posttest designs) will systematically overestimate the true effect, unless 
adjustments are applied. Specifically, they found an overestimate by a factor of 2 for studies with 
a correlation of .75 for the test-retest reliabilities (the typical case; see Dunlap et al. 1996, p. 

171). Using this adjusted rule of thumb results in higher required effect sizes for dependent- 
sample designs: d < 0.40 is interpreted as insignificant; d between 0.40 and 1.00 is interpreted as 
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small effects; d between 1.00 and 1.60 is interpreted as medium effects; and values of d larger 
than 1.60 are interpreted as large effects. 

One major methodological issue for meta-analysis concerns whether it is appropriate to 
compute multiple effect sizes from a single publication or study. For example, many studies 
include multiple treatment groups (e.g., that receive different kinds of feedback) where each 
treatment group is compared to the same control group that received no feedback. Other studies 
use a single treatment group and a control group, but these two groups are compared with respect 
to multiple outcome measures. In cases like these, it is statistically possible to compute multiple 
effect sizes, one for each statistical comparison. But in that case, the effect sizes are not truly 
independent. Including multiple effect sizes from a study provides greater weight to that 
particular study, which could become a problem if that study was biased in some way. 

At the same time, choosing only a single comparison from a given study fails to represent 
the overall findings of the study and does not provide the basis for comparisons across different 
meta-analyses. Thus, we decided to provide a comprehensive coverage of all comparisons 
reported in these studies, at the risk of including multiple comparisons based on a single group. 

Specifically, we used the following approach: First, we computed an effect size for every 
relevant mean difference reported in these studies. In total, we computed 172 effect sizes from 
the 25 studies included in our meta-analysis, or on average about seven effect sizes per study. 
These individual effect sizes are given in Appendix B. We then analyzed the independent 
variables associated with each effect size, to detennine whether they represented distinctions that 
were relevant for the purpose of our meta-analysis. In cases where two effect sizes were 
associated with a single configuration of independent variables, we computed an average effect 
size. Appendix C shows the effect sizes used for our final meta-analysis. 

For example, Ashwell (2000) used a pretest-posttest design for three different groups of 
students. Each group received feedback focused on form and content. The different kinds of 
feedback were provided in different orders, but those distinctions were not relevant for the 
purposes of our meta-analysis. Each of the three groups was then evaluated for two outcome 
measures: one for grammatical accuracy and one for content. Because the distinction between 
grammar versus content outcomes is relevant for the purposes of our meta-analysis, these 
individual effect sizes were retained in the final analysis. That is, each group in the Ashwell 
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study was used for two different effect sizes in the final analysis: one for a grammar outcome 
measure and one for a content outcome measure. 

The analysis of the study by Berg (1999) is relatively uncontroversial: Three different 
groups received feedback, contrasted with a control group that received no feedback. The groups 
were compared for a single outcome measure. Because the three groups were independent 
samples, we retained all three effect sizes in the final analysis. 

In contrast, Bitchener, Young, and Cameron (2005) was based on two treatment groups, 
each compared with a control group for numerous outcome measures. In this case, the outcome 
measures (e.g., preposition use, tense use, article use) were all specific indicators of the same 
underlying outcome type: grammatical accuracy. In addition, each group was measured at 
different points in time. That is, all individual effect sizes for a group are instances of the same 
configuration of independent variables. As a result, 12 different effect sizes were averaged for 
each group, so that only two average effect sizes from this study were used in the final meta¬ 
analysis. 

Appendix C shows the result of this step of the analysis, displaying each of the final 
effect sizes (or average effect sizes) used in our final meta-analysis. A large number of the 
original comparisons from these studies were specific measures of the same underlying 
parameter, and they were averaged for the meta-analysis. Thus, the 172 individual effect sizes 
that we computed were reduced to 88 effect sizes used in the final meta-analysis. However, those 
88 effect sizes take into account every statistical comparison reported in the original studies. 

4.3. Computing Mean Effect Sizes and Dispersion Measures 

Once the effect sizes were computed for each individual study, it was possible to 
compute mean effect sizes for the different feedback conditions. For example, it was possible to 
compare the mean improvement in writing quality (the mean effect size) for students who 
received error correction compared to students who received global comments. This comparison 
was accomplished by simply computing the arithmetic mean for all effect sizes of a given type. 

However, a simple comparison of mean effect sizes is not in itself very meaningful 
without also considering the dispersion of effect sizes around that mean. This calculation is 
required to determine the extent to which effect sizes vary across comparisons of a given type. 
For this purpose, we computed 95% confidence intervals: 
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Cl = d± [ (95% t = distribution at k -1 df) (sd / square root(k))] 


where d is the average effect size, sd is the standard deviation of all effect sizes, and k is the 
number of effect sizes (see Norris & Ortega, 2000, p. 187; Woods, Fletcher, & Hughes, 1986). 
The confidence interval provides an estimate of how well we can estimate the mean score, given 
the number of effect sizes used for the estimate and the range of variation among those effect 
sizes. The confidence interval is weighted by the t score, reflecting the fact that estimates based 
on relatively few studies are less precise than estimates based on a large number of studies. 
Smaller confidence intervals indicate that the observed effects are more robust. In particular, if 
the confidence interval does not include 0.0, it indicates that the mean effect size is significantly 
different from the null hypothesis of no effect. However, if the confidence interval includes 0.0, 
then no interpretation of a significant difference is possible. 

Using these techniques, we computed mean effect sizes" and confidence intervals for the 
theoretically relevant comparisons within this research domain. Many of these comparisons are 
based on small samples and so we interpreted them with caution. For comparisons based on 
fewer than five effect sizes, we simply reported the individual effect sizes rather than the 
descriptive statistics. 


5. Results of the Meta-Analysis 
5.1. Breakdown of Comparisons Across Study Parameters 

The studies used as the basis of the meta-analysis yielded multiple instances of 
comparisons for several of the key parameters of variation within this research domain, 
pennitting further exploration of those variables. The effect sizes used for the meta-analysis were 
evenly split between pretest/posttest comparisons (44 effect sizes) and treatment/control group 
comparisons (44 effect sizes). In addition. Table 8 shows that the major parameters of variation 
among studies are sufficiently represented for further analysis. (The specific coding information 
for each of these effect sizes is given in Appendix C.) Thus, for example, multiple comparisons 
are presented for both LI-English students and for students studying a second or foreign 
language (English or other languages), including both high and low levels within the L2 group. 
Most of these studies targeted university students (74%) perfonning pedagogical writing tasks 
(86%). Most studies focused on the effect of teacher feedback (74%), but a moderate number 
focused on feedback from other sources (26%). Similarly, the majority of studies focused on 
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written feedback, but a moderate number of studies considered oral feedback (28%). Over half of 

the studies provided feedback on both form and content (58%), but a substantial number of 

studies provided feedback only on form (27%). Most studies provided feedback in the form of 

comments, often together with locating specific errors. Almost all of the studies included in the 

final meta-analysis used outcome measures of writing proficiency (as opposed to the number of 

revisions made or changes in attitudes). Beyond that, studies focused on three major areas of 

improvement: grammatical fonn (usually number of errors), content, or an overall holistic rating. 

Table 8 

Breakdown of the Specific Comparisons Used to Compute Outcome Effect Sizes 

Independent variable 

Number of effect sizes 

Language background 

Pretest versus posttest designs 

LI-English 

8 

L2-English 

23 

L2-other than English 

7 

Mixed 

6 

Subtotal 

44 

Treatment versus control designs 

LI-English 

12 

L2-English 

24 

L2-other than English 

7 

Mixed 

1 

Subtotal 

44 

L2 proficiency level 

Pretest versus posttest designs 

Low^eginner 

16 

High/advanced 

17 

Subtotal 

33 

Treatment versus control designs 

Low^eginner 

12 

High/advanced 

16 

Subtotal 

28 
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Independent variable 
Feedback source 
Teacher 

Other (peers or computer) 
Subtotal 

Teacher 

Other (peers or computer) 
Subtotal 

Feedback mode 
Written 
Oral 
Missing 
Subtotal 

Written 

Oral 

Missing 

Subtotal 

Feedback focus 
Form 
Content 

Form and content 
Awareness of revision process 
Missing 
Subtotal 


Number of effect sizes 
Pretest versus posttest designs 
40 
4 

44 

Treatment versus control designs 
25 
19 
44 

Pretest versus posttest designs 
30 
12 
2 

44 

Treatment versus control designs 
30 
13 
1 

44 

Pretest versus posttest designs 
16 
2 

24 

0 

2 

44 
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Independent variable 


Number of effect sizes 


Treatment versus control designs 


Fonn 8 

Content 2 

Form and content 27 

Awareness of revision process 3 

Missing 4 

Subtotal 44 


Feedback type 
Comments 
Error location 
Comments + location 
Other 
Subtotal 

Comments 

Error location 

Comments + location 

Other 

Subtotal 

Outcome measure of writing development 


Pretest versus posttest designs 
16 
10 
16 
2 

44 

Treatment versus control designs 
26 
4 
8 
6 
44 


Pretest versus posttest designs 

Proficiency measure 44 

Revisions 0 

Other 0 

Subtotal 44 

Treatment versus control designs 
Proficiency measure 40 

Revisions 3 

Other 1 

Subtotal 44 
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Independent variable 

Number of effect sizes 

Specific focus of outcome measure 

Pretest versus posttest designs 

Grammar/form 

18 

Content 

8 

Holistic rating of quality 

18 

Spelling 

0 

Subtotal 

44 


Treatment versus control designs 

Grammar/form 

18 

Content 

11 

Holistic rating of quality 

9 

Spelling 

5 

Not reported 

1 

Subtotal 

44 


5.2 The Influence of Design Type 

With this background, it was possible to compute mean effect sizes for all variables that 
were theoretically relevant and represented by an adequate number of comparisons in our pool of 
research studies. For example, Table 9 reports the mean effect sizes for the two types of research 
designs included here, showing that pretest versus posttest comparisons report larger gains than 
studies that compare treatment groups to control groups. 


Table 9 


Mean Effect Sizes for Research Design Types 


Research design type 

K 

Mean {d) 

SD(d) 

95% confidence interval 

Pretest vs. posttest 

44 

.98 

.92 

.70 to 1.26 

Treatment vs. control groups 

44 

.53 

.82 

.28 to .78 
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The greater gains found in pretest versus posttest designs has two major sources. First, it 
can be attributed in part to the inflation in estimated effect size for dependent sample designs 
resulting from using the t score divided by the square root of N, rather than the more accurate 
formula based on the standard deviations for Time 1 and Time 2. 

However, the difference can also be attributed to the natural development in writing 
proficiency that comes with time. Further, since these pretest versus posttest designs usually 
consist of students working on multiple drafts of the same essay, the gains reflect the natural 
process of improvement that results from revision. Thus, the gains in performance in pre- versus 
posttest comparisons are influenced both by the feedback treatment and by the natural 
development that occurs over time in association with the revision process. This interpretation is 
further supported by two of these studies that reported pre-post test results for a control group 
that received no feedback. Thus, Hillocks (1982) reported an effect size of .82 for writing quality 
improvement with no feedback, and Brakel-Olson (1990) reported an effect size of 1.04 for 
holistic writing improvement with no feedback. These two comparisons were excluded from the 
meta-analysis, because no feedback was provided to students. However, they indicate that the 
particular pre-post test designs included in this study, many of which lacked control groups, can 
tell relatively little about the influence of feedback, because that treatment is confounded with 
the natural development that comes with time. 

In contrast, the influence of natural development is accounted for in treatment-control 
group designs, because both groups practice writing for the same amount of time. We still see 
positive gains in writing proficiency in treatment-control designs, but the effect sizes are more 
modest. This latter finding answers the overall question: Does feedback on student writing, 
considered on its own, result in gains in writing development? The answer is yes, but the gains 
overall are not strong. 

Because treatment-control design studies have greater experimental validity than pre-post 
design studies that include only a single treatment group (with no control group), we have given 
greater weight to the findings from those studies in our meta-analysis. Thus, for each of the 
following analyses, we report two sets of findings: 1) the mean effect sizes for the pretest- 
posttest designs and 2) the mean effect sizes for the experimental treatment-control designs. 
Further, when possible we present separate findings for LI-English students and for L2 students. 
However, some comparisons are based on too few effect sizes to permit this level of detail. 
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5.3. Ll-English Versus L2 Groups 

Table 10 shows that learners of a second/foreign language made larger gains associated 
with feedback than native speakers of English (in studies with treatment-control designs). This 
can in part be explained by the different outcome measures used in Ll-English versus L2 studies: 
The former measured improvement in content scores or in holistic measures of writing quality, 
while the latter often measured improvement in grammatical accuracy (see Table 22 in Section 
5.6). In addition, this difference might relate to proficiency level, since Ll-English writers 
presumably are at a higher proficiency level than L2 students. That possibility is supported by 
the findings reported for the L2 group in Table 11, which shows that low-level students achieve 
larger gains in writing development than more advanced students. Apparently larger gains in 
proficiency are possible for low-level groups, simply because they need to learn so much. This 
pattern holds for low-proficiency compared to high-proficiency L2 students and apparently also 
holds for L2 students compared to Li students. (There is little difference if we consider only the 
studies that employed pretest-posttest designs, shown especially by the overlaps in the 
confidence intervals.) 

It was not possible to compare gains across age groups or across writing tasks, because 
many studies did not specify these characteristics, and for those that did, most studies relied on a 
single age group (university students) and a single task type (the pedagogical essay). 


Table 10 


Mean Effect Sizes for Each Language Group 


Language group 

k 

Mean (d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Ll-English 

8 

1.20 

.83 

.50 to 1.89 

L2-English 

23 

.92 

.84 

.56 to 1.28 

L2-other 

7 

1.53 

1.29 

.33 to 2.72 




Treatment/control design 

Ll-English 

12 

-0.03 

.63 

-.43 to .37 

L2-English 

24 

0.66 

.84 

.30 to 1.01 

L2-other 

7 

1.09 

.55 

.58 to 1.60 


37 



Table 11 


Mean Effect Sizes for Language Proficiency Levels (L2 Students Only) 


Language proficiency level 

k 

Mean {d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Low/beginning 

16 

1.35 

1.03 

.80 to 1.90 

High/advanced 

17 

1.03 

.75 

.64 to 1.42 




Treatment/control design 

Low/beginning 

12 

0.98 

.94 

.39 to 1.58 

High/advanced 

16 

0.46 

.87 

.00 to .93 


5.4. Source and Mode of Feedback 

Turning to the different ways in which feedback was provided, we find several interesting 
patterns. On first consideration, there appears to be little overall difference between feedback 
provided by the teacher and feedback provided from other sources, as shown in Table 12. 

However, further exploration of this general pattern shows that it hides a relatively large 
difference between Ll-English and L2-English learners: As Table 13 shows, Ll-English writers 
had much larger gains resulting from teacher feedback than from other feedback (peer or 
computer). In contrast, L2-English writers showed exactly the opposite trend, with much larger 
gains resulting from other feedback. (The same trends are shown for both design types, although 
some of the comparisons are based on only a few effect sizes.) 

Table 12 


Mean Effect Sizes for Different Sources of Feedback 


Source 

k 

Mean (d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Teacher 

40 

.96 

.91 

.67 to 1.25 

Other 

4 

.19 

n/a 

n/a 



.31 





1.96 





2.33 






Treatment/control design 

Teacher 

25 

.53 

.60 

.28 to .78 

Other 

19 

.52 

1.06 

.01 to 1.04 
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Table 13 


Mean Effect Sizes for Different Sources of Feedback—LI English Versus L2 English 


Feedback source 

k 

Mean (d) SD (d) 

95% confidence interval 

Teacher 

6 

Pretest/post-test design for LI English 

1.52 .70 .78 to 2.25 

Other 

2 

.19 n/a 

n/a 

Teacher 

2 

.31 

Treatment/control design for LI English 

.53 n/a n/a 

Other 

10 

.53 

-.14 .63 -.59 to .31 

Teacher 

21 

Pretest/posttest design for L2 English 

.80 .78 .45 to 1.16 

Other 

2 

1.96 n/a 

n/a 

Teacher 

16 

2.33 

Treatment/control design for L2 English 

.28 .50 .02 to .55 

Other 

8 

1.41 .92 

.63 to 2.18 


Surprisingly, oral feedback seems to have been more effective in these studies than 
written feedback, as shown in Table 14. (Again the same trends are shown for both design 
types.) However, similar to the findings on source of feedback, Table 15 shows that LI-English 
students differ from L2-English students in their preferred mode of feedback. Although based on 
only a few effect sizes, Table 15 shows that LI-English students achieved strong gains in writing 
proficiency resulting from oral feedback, contrasted with no or small gains resulting from written 
feedback. In contrast, L2-English students achieved moderately strong gains in writing 
proficiency following both oral and written feedback. These findings are based on relatively few 
effect sizes and so must be interpreted with caution. However, coupled with the findings on 
preferred source of feedback (Tables 12 and 13), they suggest an interesting difference in the 
typical learning styles of LI-English versus L2-English students. 
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Table 14 

Mean Effect Sizes for Different Modes of Delivery of Feedback 


Feedback mode 

k 

Mean {d) 

SD (d) 

95% confidence interval 




Pretest/posttest design 

Written 

30 

.68 

.70 

.42 to .94 

Oral 

12 

1.86 

.91 

1.29 to 2.44 




Treatment/control design 

Written 

30 

.40 

.91 

.07 to .74 

Oral 

13 

.84 

.52 

.52 to 1.15 


Table 15 

Mean Effect Sizes for Feedback Modes of Delivery—LI English Versus L2 English 


Feedback mode 

k 

Mean (d) SD (d) 

95% confidence interval 

Written 

2 

Pretest/posttest design for L1 English 

.61 n/a n/a 

Oral 

4 

.81 

1.54 n/a n/a 

Written 

10 

1.80 

1.86 

2.47 

Treatment/control design for LI English 

-.14 .63 -.59 to .31 

Oral 

2 

.53 n/a 

n/a 

Written 

21 

.53 

Pretest/post-test design for L2 English 

.80 .78 .45 to 1.16 

Oral 

2 

1.96 n/a 

n/a 

Written 

19 

2.33 

Treatment/control design for L2 English 

.69 .94 .24 to 1.14 

Oral 

5 

.53 .35 

.10 to .97 
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5.5. The Focus and Type of Feedback 

Most studies in our sample provided feedback that focused on both content and form, but 
some studies focused strictly on one or the other. However, as Table 16 shows, feedback that 
focuses purely on form is less effective than feedback that focuses on content plus form. This 
finding seems to support the claim that writing tasks and feedback should be meaningful for 
students, with tasks that focus on the communication of information. Such communicative tasks 
are apparently very effective when coupled with a focus on form, while an exclusive focus on 
form (with no attention to content) is considerably less effective. 

Table 16 


Mean Effect Sizes for the Different Focuses of Feedback 


Focus 

k 

Mean (d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Content 

2 

.09 


n/a 



1.22 



Form 

16 

.56 

.55 

.27 to .85 

Content + fonn 

24 

1.20 

1.02 

.76 to 1.63 




Treatment/control design 

Content 

2 

.10 


n/a 



1.12 



Form 

8 

.08 

.44 

-.28 to .45 

Content + fonn 

27 

.48 

.87 

.14 to .82 


This same pattern holds when we consider only L2-English learners. The results for 
treatment-control designs in Table 17 show that the greatest gains in writing development for L2- 
English writers were made with feedback that focused on content and fonn, while feedback 
focused on fonn resulted in no significant gains for this group. (The results for pretest-posttest 
designs show little difference here, especially when the overlap in confidence intervals is 
considered.) 
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Table 17 

Mean Effect Sizes for the Different Focuses of Feedback (F2 English Only) 


Focus 

k 

Mean {d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Content 

2 

.09 

n/a 

n/a 



1.22 



Form 

7 

.91 

.65 

.31 to 1.51 

Content + fonn 

12 

.77 

.90 

.20 to 1.34 




Treatment/control design 

Content 

2 

.10 

n/a 

n/a 



1.12 



Form 

7 

.03 

.44 

-.38 to .44 

Content + fonn 

10 

.71 

.88 

.08 to 1.34 


The findings on the different types of feedback, shown in Table 18, are consistent with 
the findings on feedback focus: commenting results in greater gains than error location. 

These findings were somewhat difficult to interpret because many studies were vague in 
their descriptions of feedback type. It was often unclear what kinds of feedback were provided as 
comments, and many studies were mixed in that they both identified the location of some errors 
and provided comments in the margins and at the end of the paper. In addition, eight studies 
provided feedback in the form of general training (e.g., on the revision process) rather than 
specific feedback on a writing sample. Thus, Table 18 indicates that commenting is the most 
effective type of feedback, but the differences are small in the treatment-control studies. 
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Table 18 


Mean Effect Sizes for the Different Types of Feedback 


Type 

k 

Mean {d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Comments 

16 

1.57 

.96 

1.06 to 2.08 

Error location 

10 

.56 

.92 

-.10 to 1.22 

Mixed 

16 

.75 

.60 

.43 to 1.08 

Other 

2 

.19 


n/a 



.31 






Treatment/control design 

Comments 

26 

.40 

.84 

.06 to .74 

Error location 

4 

.05 

n/a 

n/a 



.28 





.46 





2.28 



Mixed 

7 

.18 

.21 

-.02 to .37 

Other 

6 

1.23 

.80 

.40 to 2.07 


5.6. Comparing Different Types of Outcome Measures: The Different Ways in Which 
Writing Proficiency Can Develop 

Finally, we can ask whether all aspects of writing development can be improved through 
feedback. This question can be investigated by comparing the effect sizes for the different kinds 
of outcome measures. Table 19 shows the comparison between the two general outcome types 
represented in these studies: scores of writing proficiency (reflecting accuracy or quality) versus 
a measure of the extent to which writing had been revised across multiple drafts. Gains were 
reported for both kinds of outcome measures, although those gains were much larger for 
measures of revising (regardless of quality or accuracy). Although based on only three effect 
sizes, this finding seems uncontroversial—and uninteresting: Students will make more revisions 
when they are given feedback that tells them that they should make revisions. 
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Table 19 


Mean Effect Sizes for the Different Outcome Measures of Writing Development 


Outcome measure 

k 

Mean ( d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Writing proficiency 

44 

.98 

.92 

.70 to 1.26 

Revisions 

1 

.53 

n/a 

n/a 




Treatment/control design 

Writing proficiency 

40 

.42 

.77 

.18 to .67 

Revisions 

3 

1.51 

n/a 

n/a 



1.90 





2.25 




More insightful analyses are possible by considering only studies of improvement in 
writing proficiency, comparing the specific focuses of the outcome measure (e.g., a focus on 
grammar/form, spelling, content, or overall holistic quality). As Table 20 shows, gains were 
reported for most outcome measures. For measures of content, those gains were small and not 
significant (shown by the confidence interval including 0.0), but gains were larger for outcome 
measures focused on grammar and holistic measures of writing quality (which presumably 
reflects both form and content). (The gains for holistic quality were smaller for treatment-control 
designs.) In contrast, outcome measures of spelling accuracy actually showed a decrease 
following feedback. 

When the language groups are distinguished, as in Table 21, we see that L2 students 
made gains in grammar/form and overall quality (and nonsignificant gains in content), but that 
the gains for LI students were restricted to holistic ratings of overall quality. 
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Table 20 


Mean Effect Sizes for the Different Focuses of Outcome Measures 


Outcome focus 

k 

Mean (d) 

SD (d) 

95% confidence interval 




Pretest/posttest design 

Grammar/form 

18 

1.12 

1.08 

.58 to 1.66 

Content 

8 

.40 

.60 

-.09 to .90 

Holistic rating of quality 

18 

1.10 

.79 

.71 to 1.49 




Treatment/control design 

Grammar/form 

18 

.80 

.78 

.41 to 1.18 

Content 

11 

.58 

1.00 

-.09 to 1.25 

Holistic rating of quality 

9 

.51 

.40 

.20 to .81 

Spelling 

5 

-.53 

.33 

-.95 to -.12 


Table 21 


Mean Effect Sizes for Different Focuses of Outcome Measures—LI English Versus L2 English 


Outcome focus 

k 

Mean (d) 

SD(d) 

95% confidence interval 



Pretest/posttest design for LI English 

Holistic rating of quality 

8 

1.20 

.83 

.50 to 1.89 



Treatment/control design for LI English 

Content 

5 

.24 

.64 

-.55 to 1.03 

Spelling 

5 

-.53 

.33 

-.95 to -.12 



Pretest/posttest design for L2 English 

Holistic rating of quality 

8 

1.23 

.74 

.61 to 1.85 

Content 

6 

.51 

.67 

-.20 to 1.21 

Grammar/form 

16 

1.19 

1.13 

.59 to 1.79 



Treatment/control design for L2 English 

Holistic rating of quality 

7 

.56 

.43 

.16 to .96 

Content 

6 

.86 

1.22 

-.42 to 2.14 

Grammar/form 

18 

.80 

.78 

.41 to 1.18 
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5.7. Are Particular Kinds of Feedback Associated With Particular Gains in Writing 
Development? 

The obvious question at this point is whether the particular type or focus of feedback 
results in specialized gains in writing development, reflected by the focus of the outcome 
measure. Table 22 shows the influence of feedback focus on different aspects of writing 
development. 

The findings are not encouraging for the role of feedback in improving content ratings 
(considered only in treatment-control designs): Apparently students do not become more 
informative, logical, or elaborated in their prose as a result of feedback, regardless of the focus of 
that feedback. Thus, regardless of the focus of feedback, students made no significant gains in 
content scores. 

Table 22 


Mean Effect Sizes for Different Outcome Focuses, Depending on the Feedback Focus 


Outcome focus 

k 

Mean (d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Holistic quality rating 





Fonn feedback 

8 

.48 

.41 

.14 to .82 

Content + form feedback 

8 

1.46 

.65 

.92 to 2.00 

General feedback 

2 

1.96 

n/a 

n/a 

(unspecified focus) 


2.33 



Grammar/form accuracy rating 




Content + form feedback 

12 

1.36 

1.18 

.61 to 2.10 

Fonn feedback 

5 

.77 

.77 

-.19 to 1.73 




Treatment/control design 

Content rating 





Content + form feedback 

6 

.25 

.57 

-.34 to .85 

Grammar/fonn accuracy rating 




Content + form feedback 

13 

1.02 

.81 

.53 to 1.51 

Fonn feedback 

5 

.22 

.19 

-.03 to .46 
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The overall holistic quality of an essay is probably the least informative of the outcome 
measures because it is impossible to identify the specific aspects of writing development that 
have improved. However, this outcome measure is also the most popular in pretest-posttest 
studies. Table 22 suggests that it is possible to achieve large gains in the holistic quality rating, 
resulting either from comments on content and form or from general feedback with an 
unspecified focus (which presumably includes comments). In contrast, feedback on fonn results 
in only small gains in these holistic quality scores. 

Table 22 further shows that it is possible to improve grammatical accuracy. But the 
relative importance of the predictor variables is surprising here: Feedback focused on a 
combination of form and content results in a much greater improvement of grammatical accuracy 
than feedback that focuses exclusively on form. This difference is found for both design types. 
This finding suggests that student writing for real-world purposes, with the goal of 
communicating particular content, enables and encourages students to achieve greater gains in 
writing development than artificial writing tasks that are focused primarily on grammatical 
accuracy. 

Finally, we can consider this same general question in relation to the different types of 
feedback. Although some of the comparisons are based on very few effect sizes, Table 23 
suggests that specific feedback of any type is not very helpful for improving content ratings. In 
contrast, general training in the revision process does seem to help students improve the content 
of their papers. The results from pretest-posttest studies indicated that holistic quality ratings can 
be improved considerably by feedback in the form of comments, while specific comments tied to 
a particular location in the text provide less benefit for improvement in holistic quality. 

The most interesting pattern in Table 23 has to do with the grammatical accuracy ratings, 
where the greatest improvements are associated with feedback in the form of comments. In 
contrast, error location feedback (either with or without more detailed comments) results in only 
small gains. Both design types show the same trend here. This finding applies only to L2 students 
(see Table 24), because none of these studies investigated grammatical accuracy for LI-English 
students. The patterns for gains in grammatical accuracy shown in Tables 22-24 are interesting 
because they suggest that students benefit more from general explanations of a grammatical 
phenomenon than from identification of specific errors. In fact, error identification seems to 
detract from the benefit of commenting: Students made smaller gains when feedback included 
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error identification, even if comments supplemented the identification of errors. Apparently, 
explanations of error patterns are more helpful than identifying selected specific errors. 

Table 23 


Mean Effect Sizes for Different Outcome Focuses, Depending on the Feedback Type 


Outcome focus 

k 

Mean (d) 

SD(d) 

95% confidence interval 




Pretest/posttest design 

Content rating 





Feedback with 
error location 

7 

.29 

.54 

-.21 to .78 

Holistic quality rating 





Feedback as comments 

8 

1.67 

.66 

1.12 to 2.23 

Feedback with 
error location 

10 

.64 

.55 

.25 to 1.04 

Grammar/form accuracy rating 





Feedback as comments 

7 

1.50 

1.33 

.27 to 2.73 

Feedback with 
error location 

11 

.88 

.88 

.29 to 1.47 




Treatment/control design 

Content rating 





Feedback as comments 

8 

.09 

.63 

-.44 to .62 

Training on 

3 

1.51 

n/a 

n/a 

revision process 


1.90 





2.25 



Grammar/form accuracy rating 





Feedback as comments 

8 

1.18 

.73 

.57 to 1.79 

Feedback with 
error location 

10 

.49 

.70 

-.02 to 1.32 
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Table 24 

Mean Effect Sizes for Different Outcome Focuses, Depending on the Feedback Type (F2 
Students Only) 

Outcome focus k Mean (d) SD (d) 95% confidence interval 

Pretest/posttest design 

Grammar/form accuracy rating 


Feedback as 
comments 

7 

1.50 

1.33 

.27 to 2.73 

Feedback with 
error Location 

9 

.95 

.96 

.21 to 1.69 




Treatment/control design 

Grammar/form accuracy rating 





Feedback as 
comments 

8 

1.18 

.73 

.57 to 1.79 

Feedback with 
error location 

10 

.49 

.70 

-.02 to .99 


6. Discussion and Implications for TOEFL 

Several important general patterns emerge from this synthesis of research on the 
effectiveness of feedback for individual writing development: 

1. Interest in this research question is widespread: More than 300 studies have been 
published on this topic in the last 25 years. 

2. But the large majority of studies in this research domain are not suitable for inclusion 
in a meta-analysis: Less than 10% of the studies in our sample were found to be 
suitable. There were three general reasons for this: 

2a. Many studies in this domain were qualitative, or thought pieces. 

2b.Many of the quantitative studies in this domain were not carefully designed, or the 
reporting standards were not adequate for the purposes of meta-analysis. 

2c. Some quantitative studies addressed different research questions from the one that 
we are focusing on here. 
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3. Both LI-English and L2-English students make gains in writing development in 
response to feedback. 

4. But lower proficiency levels make greater gains in writing development in response 
to feedback than students with high proficiency levels: L2 students make greater 
gains than LI students, and low proficiency L2 students make greater gains than high 
proficiency L2 students. 

5. Large differences exist in how LI-English students and L2-English students respond 
to feedback from different sources and different modes. 

5 a The greatest gains for LI-English students are achieved in response to teacher 
feedback presented orally. 

5b. The greatest gains for L2-English students are achieved in response to other 
feedback, including feedback from other students and feedback from computer. 

6. A combined focus on content + form results in greater gains in writing development 
than an exclusive focus on form. 

6a. If we consider only L2-English learners, the differential influence of a focus on 
content + form versus fonn becomes even greater. 

7. Larger gains in writing development result from feedback that is expressed through 
comments than from locating/correcting errors. 

7a These patterns are stronger for L2-English students, with moderately large gains 
resulting from comments, versus smaller or insignificant gains from error location 
feedback, and only small gains resulting from other feedback (including direct 
error correction and training in the revision process). 

8. It is apparently difficult to provide specific feedback that improves the content of 
student writing. 

8a. Providing specific feedback on previous writing samples, whether through 
comments or through identifying specific trouble spots in the paper, results in 
only small improvements in content scores. 

8b. In contrast, providing training in the revision process results in large gains in 
content scores (based on only three effect sizes). 
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9. Providing feedback expressed through comments with a combined focus on content + 
form improves holistic quality ratings. 

9a. Error location or an exclusive focus on form results in only small gains for holistic 
quality. 

10. Grammatical accuracy can be best improved by feedback that focuses on a 
combination of form and content 

10a. Feedback that focuses exclusively on form does not result in a significant 
improvement in grammatical accuracy 

10b. Feedback provided as written comments results in large gains in grammatical 
accuracy 

10c. Feedback as error location results in smaller or no gains in grammatical 
accuracy. 

Some of these findings are surprising, running counter to our prior expectations. For 
example, a widely held perception notes that teachers have more authority and prestige in many 
nonwestem cultures than in American and British society. Because of that, we expected that 
teacher feedback would be more influential for F2-English students than for FI-English students. 
Previous studies on peer editing have found that FI-English students are receptive to this 
approach and make modest gains (Graham & Perin, 2007). In contrast, previous studies of F2- 
English students noted skepticism of the value of feedback from other students and that greater 
gains are made from teacher feedback than from peer feedback (see, e.g., Tsui & Ng, 2000). 

Thus, we expected that teacher feedback would be more influential than peer or computer 
feedback for F2-English students. However, those predictions were not supported by the meta¬ 
analysis. Rather, El-English students showed strong gains from teacher feedback and no gains 
from peer feedback, while L2-English students showed greater gains from peer or computer 
feedback than from teacher feedback. The findings on mode of delivery are also surprising: Ll- 
English students showed strong gains from oral feedback and no overall gains from written 
feedback, while L2-English students showed moderately strong gains from both oral and written 
feedback. 

These findings further indicate that L2-English students are more adaptive than Ll- 
English students. That is, L2-English students make gains in writing proficiency regardless of 
how feedback is presented: oral or written, from teachers or peers. This finding could in part be a 
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reflection of their lower proficiency status, but it also seems to reflect their ability to benefit from 
many different types of feedback. In contrast, LI-English students are quite polarized in terms of 
the kinds of feedback that are effective: From the teacher, presented orally; peer feedback and 
feedback presented in writing actually resulted in a loss of writing proficiency for LI writers in 
these studies. 

Similarly, the findings on the focus of feedback are noteworthy: A combined focus on 
content + form is generally much more effective than an exclusive focus on form for writing 
development. This trend is stronger for L2-English students than for LI-English students. 

This pattern holds for outcomes that measure the overall holistic quality of student 
writing. But even for the improvement of grammatical accuracy, combined feedback on content 
plus form is more effective than feedback focused exclusively on form, supporting the general 
claims of Truscott (2007). (Unfortunately, no study in our sample investigated the influence of 
feedback focused exclusively on content for the development of grammatical accuracy.) These 
patterns seem to support the general approach advocated by proponents of content-based 
instruction (CBI), showing that real-world tasks with a focus on communicating actual content 
are more effective learning environments than tasks focused exclusively on grammatical form. 

A similar trend seems to be at work for outcomes that measure improvement in content 
scores. In this case, specific feedback provided on previous written papers did not result in a 
significant improvement in content scores. In contrast, training in the revision process itself 
resulted in large improvements in content scores. This finding could have the same explanation 
as above: Helping students leam how to revise, with a focus on effective communication rather 
than a specific written product, results in the largest gains in proficiency (at least for the content 
of writing). 

In contrast, specific feedback provided on papers was effective for improvements in 
holistic quality and in grammatical accuracy. In these cases, it was surprising that the type of 
feedback seemed equally important to the focus of feedback. It was predictable that feedback in 
the form of written comments would result in greater gains in overall holistic quality than 
feedback that identified the location of errors. However, the surprising findings here have to do 
with the improvement of grammatical accuracy: Feedback provided through written comments 
was found to be more effective for improving grammatical accuracy than error location. This 
finding suggests that students benefit the most from descriptions and/or explanations of their 
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grammatical patterns. In contrast, students might regard direct error identification as simple 
editing corrections, and so they seem less likely to generalize from those corrections to other 
instances of similar constructions. Here again, these findings are consistent with Truscott’s 
(2007) findings that students improve little in their grammatical accuracy based on direct error 
correction. 

All of the above conclusions should be treated with caution, because the meta-analysis 
has major limitations. First of all, this research domain was not very “mature” when evaluated 
for the purposes of meta-analysis. Although we were able to identify 306 published research 
studies that investigated the effectiveness of feedback on student writing, only 25 of those 
studies (less than 8%) proved to be suitable for inclusion in the statistical meta-analysis. Because 
a sample size of 25 is too small to permit comparisons for many of the parameters of interest, we 
permitted the inclusion of multiple effect sizes from the individual studies. But this decision 
introduced the risk that a single aberrant study (e.g., with a flawed design or methods) might 
have a relatively large influence on the overall results of the meta-analysis. Finally, even with 
this compromise, we ended up with only 88 effect sizes, and as a result, several of the specific 
findings from the meta-analysis were based on a sample of fewer than five effect sizes. In 
particular, the small sample size restricted the extent to which we could examine interactions 
among variables, and as a result, the influence of some variables could be confounded. 

In addition, meta-analysis is inherently reductive in nature, and as a result, many of the 
particular findings from individual studies are discounted. To be published, it is usually 
necessary for a study to be innovative, filling some gap in the previous literature. Thus, direct 
replications of previous research are almost never published. And as a result, none of these 
studies are exactly comparable. For the purposes of the meta-analysis, we collapsed numerous 
more specific measures into a few general categories. But in fact, studies employed many 
different specific treatments and measured many different specific outcomes. This is an 
important caveat that should be applied to any quantitative meta-analysis: Its strength is 
identifying the general trends that hold across a research domain, but its major weakness is that 
those generalizations obscure the individual patterns of variation found across studies. 

Thus, all of the general findings described in this report should be subjected to further 
research, with more tightly controlled designs. However, the results are interesting, with three 
general findings especially worthy of future research: 
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1. L2-English students seem to be very receptive to feedback from sources other than 
teachers. 

2. Feedback on content is at least as important as feedback on fonn. Even when the 
writing development goals are to improve grammatical accuracy, feedback on form 
coupled with feedback on content is more effective than feedback focused exclusively 
on form. 

3. Feedback in the fonn of written comments is more effective than simple error 
location, again even when the writing development goals are to improve grammatical 
accuracy. 
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Notes 

1 Random assignment of subjects into treatment or control groups was not practiced in nearly all 
of these studies and was thus not applied as a criterion for exclusion. 

“ In some large-scale meta-analyses, the estimates of each individual effect size are weighted 
according to the sample size of the study, either directly or by using inverse variance weights. 
Then, the study reports a weighted mean effect size rather than a simple arithmetic mean of 
effect sizes. (See, e.g., Hedges & Olkin, 1985; Lipsey & Wilson, 2001.) However, this 
practice has generally not been adopted in previous meta-analyses in applied linguistics (e.g., 
Norris & Ortega, 2000; Truscott, 2007), and so it was not employed in the present analysis. 
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Appendix B 

Summary of All Individual Effect Sizes Included in the Quantitative Meta-Analysis 


Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 

Ashwell (2000) 

PP 

Content then form 

12; 

Group 1 

LO 

C, F 

WP: accuracy: number 
of errors/ number of words 

1.7 


PP 

Form then content 

13; 

Group 2 

LO 

F,C 

WP: accuracy: number 
of errors/ number of words 

-.75 


PP 

F&C then F&C 

13; 

Group 3 

LO 

F,C 

WP: accuracy: number 
of errors/ number of words 

1.28 


PP 

Content then form 

12; 

Group 1 

LO 

C, F 

WP: content scores 
(rating 1-20) 

.17 


PP 

Form then content 

13; 

Group 2 

LO 

F, C 

WP: content scores 
(rating 1-20) 

-.08 


PP 

F&C then F&C 

13; 

Group 3 

LO 

F, C 

WP: content scores 
(rating 1-20) 

-.34 

Berg (1999) 

TC 

Trained vs. untrained, 
level 3 

24 

Revision 

activities 

Revision 

RV: meaning change 

1.51 


TC 

Trained vs. untrained, 
level 4 

22 

Revision 

activities 

Revision 

RV: meaning change 

2.25 


TC 

Trained vs. untrained, 
both levels 

46 

Revision 

activities 

Revision 

RV: meaning change 

1.90 

Bitchener, 

Young, & 

TC 

Oral and written vs. no 
feedback, time 1 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

.21 

Cameron (2005) 

TC 

Oral and written vs. no 
feedback, time 2 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

.03 


TC 

Oral and written vs. no 
feedback, time 3 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

-.30 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

Oral and written vs. no 
feedback, time 4 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

.88 


TC 

Oral and written vs. no 
feedback, time 1 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

1.17 


TC 

Oral and written vs. no 
feedback, time 2 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

.24 


TC 

Oral and written vs. no 
feedback, time 3 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

-.25 


TC 

Oral and written vs. no 
feedback, time 4 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

.77 


TC 

Oral and written vs. no 
feedback, time 1 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

.23 


TC 

Oral and written vs. no 
feedback, time 2 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

.83 


TC 

Oral and written vs. no 
feedback, time 3 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

-.02 


TC 

Oral and written vs. no 
feedback, time 4 

36; 

Group 1 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

1.76 


TC 

Written only vs. control, 
time 1 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

.32 


TC 

Written only vs. control, 
time 2 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

.10 


TC 

Written only vs. control, 
time 3 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

.04 


TC 

Written only vs. control, 
time 4 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(prepositions) 

-.17 


TC 

Written only vs. control, 
time 1 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

.30 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

Written only vs. control, 
time 2 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

-.57 


TC 

Written only vs. control, 
time 3 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

-.27 


TC 

Written only vs. control, 
time 4 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses 
(past simple) 

-.37 


TC 

Written only vs. control, 
time 1 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

.47 


TC 

Written only vs. control, 
time 2 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

.43 


TC 

Written only vs. control, 
time 3 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

.11 


TC 

Written only vs. control, 
time 4 

34; 

Group 2 

LO + DC 

F 

WP: % of correct uses (definite 
article) 

.65 

Cardelle & 
Corno (1981) 

PP 

Written, praise, pretest 
vs. posttest 1, course 1 

12; 

Group 1 

SE + SI 

F,C 

WP scores (grammar + 
vocabulary) 

.41 


PP 

Written, praise, pretest 
vs. posttest 2, course 1 

12; 

Group 1 

SE + SI 

F,C 

WP scores (grammar + 
vocabulary) 

.72 


PP 

Written, praise, pretest 
vs. posttest 3, course 1 

12; 

Group 1 

SE + SI 

F, C 

WP scores (grammar + 
vocabulary) 

.82 


PP 

Written, praise, pretest 
vs. posttest 1, course 2 

7; 

Group 2 

SE + SI 

F, C 

WP scores (grammar + 
vocabulary) 

1.75 


PP 

Written, praise, pretest 
vs. posttest 2, course 2 

7; 

Group 2 

SE + SI 

F, C 

WP scores (grammar + 
vocabulary) 

1.81 


PP 

Written, praise, pretest 
vs. posttest 3, course 2 

7; 

Group 2 

SE + SI 

F, C 

WP scores (grammar + 
vocabulary) 

1.66 


PP 

Written, criticism, pretest 
vs. posttest 1, course 1 

8; 

Group 3 

SE + ML 

F, C 

WP scores (grammar + 
vocabulary) 

2.09 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


PP 

Written, criticism, pretest 
vs. posttest 2, course 1 

8; 

Group 3 

SE + ML 

F,C 

WP scores (grammar + 
vocabulary) 

2.34 


PP 

Written, criticism, pretest 
vs. posttest 3, course 1 

8; 

Group 3 

SE + ML 

F,C 

WP scores (grammar + 
vocabulary) 

2.34 


PP 

Written, criticism, pretest 
vs. posttest 1, course 2 

6; 

Group 4 

SE + ML 

F, C 

WP scores (grammar + 
vocabulary) 

.65 


PP 

Written, criticism, pretest 
vs. posttest 2, course 2 

6; 

Group 4 

SE + ML 

F, C 

WP scores (grammar + 
vocabulary) 

1.92 


PP 

Written, criticism, pretest 
vs. posttest 3, course 2 

6; 

Group 4 

SE + ML 

F, C 

WP scores (grammar + 
vocabulary) 

.67 


PP 

Combination (praise + 
criticism), pretest vs. 
posttest 1, course 1 

12; 

Group 5 

SE, SI, ML 

F, C 

WP scores (grammar + 
vocabulary) 

3.41 


PP 

Combination (praise + 
criticism), pretest vs. 
posttest2, course 1 

12; 

Group 5 

SE, SI, ML 

F, C 

WP scores (grammar + 
vocabulary) 

4.22 


PP 

Combination (praise + 
criticism), pretest vs. 
posttest3, course 1 

12; 

Group 5 

SE, SI, ML 

F, C 

WP scores (grammar + 
vocabulary) 

4.44 


PP 

Combination (praise + 
criticism), pretest vs. 
posttest 1, course 2 

7; 

Group 6 

SE, SI, ML 

F, C 

WP scores (grammar + 
vocabulary) 

.62 


PP 

Combination (praise + 
criticism), pretest vs. 
posttest2, course 2 

7; 

Group 6 

SE, SI, ML 

F, C 

WP scores (grammar + 
vocabulary) 

.75 


PP 

Combination (praise + 
criticism), pretest vs. 
posttest3, course 2 

7; 

Group 6 

SE, SI, ML 

F, C 

WP scores (grammar + 
vocabulary) 

.64 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

Praise vs. no feedback, 
avg. posttest, course 1 

22 

SE, SI 

F,C 

WP scores 

(grammar + vocabulary) 

.76 


TC 

Praise vs. no feedback, 
avg. posttest, course 2 

13 

SE, SI 

F,C 

WP scores 

(grammar + vocabulary) 

1.07 


TC 

Criticism vs. no 
feedback, avg. posttest, 
course 1 

18 

SE, ML 

F, C 

WP scores 

(grammar + vocabulary) 

1.13 


TC 

Criticism vs. no 
feedback, avg. posttest, 
course 2 

12 

SE, ML 

F, C 

WP scores 

(grammar + vocabulary) 

.88 


TC 

Combination vs. no 
feedback, avg. posttest, 
course 1 

22 

SE, SI, ML 

F, C 

WP scores 

(grammar + vocabulary) 

2.21 


TC 

Combination vs. no 
feedback, avg. posttest, 
course 2 

13 

SE, SI, ML 

F, C 

WP scores 

(grammar + vocabulary) 

1.07 

Chandler 
(2003), Study 1 

PP 

Direct correction, 
eh. 1-5 

15; 

Group 1 

DC, CM 

F, C 

Accuracy: number of errors per 
100 words 

1.04 


PP 

Direct correction, 
eh. 1-5 

14; 

Group 2 

DC, CM 

F, C 

Fluency: minutes per 100 words 

.82 


TC 

Direct correction vs. 
control, eh 1 

31; 

Group 3 

DC, CM 

F, C 

Accuracy: number of errors per 
100 words (did not decrease) 

-.49 


TC 

Direct correction vs. 
control, eh 5 

31; 

Group 3 

DC, CM 

F, C 

Accuracy: number of errors per 
100 words 

.52 


TC 

Direct correction vs. 
control, eh 1 

26; 

Group 4 

DC, CM 

F, C 

Fluency: minutes per 100 words 

.02 


TC 

Direct correction vs. 
control, eh 5 

26; 

Group 4 

DC, CM 

F, C 

Fluency: minutes per 100 words 
(did not increase) 

-.04 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 

Chandler 
(2003), Study 2 

PP 

Revision group, 
ch 1-5 

65 

DC, CM, 
LO 

F,C 

Accuracy: number of errors per 
100 words 

.43 


PP 

Revision group, 
ch 1-5 

30 

DC, CM, 
LO 

F,C 

Fluency: mean time to write 100 
words 

1.77 


TC 

Correction vs. 
description (treated as 
control) 

61 

DC 

F, C 

Accuracy: number of errors per 
100 words 

1.02 


TC 

Underlining and 
description vs. 
description 

54 

LO+CM 

F, C 

Accuracy: number of errors per 
100 words 

.42 


TC 

Underlining vs. 
description 

57 

LO 

F, C 

Accuracy: number of errors per 
100 words 

.05 

Davis & Fulton 
(1997) 

PP 

Feedback during 
composing process 

20 

SE 

F, C 

Writing quality: rating scales 

1.86 


PP 

Feedback after 
composing process 

20 

SE 

F, C 

Writing quality: rating scales 

2.47 

Davis & Kelley 
(1999) 

PP 

Feedback during 
composing process 

45 

SE 

F, C 

Writing quality: rating scales 

1.80 


PP 

Feedback after 
composing process 

43 

SE 

F, C 

Writing quality: rating scales 

1.54 

Fathman & 
Whalley (1990) 

PP 

Grammar feedback 

14 

LO 

F 

Writing quality: grammar score 
(number of errors) 

2.08 


PP 

Grammar feedback 

14 

LO 

F 

Writing quality: content rating 

1.04 


PP 

Content feedback 

22 

CM 

C 

Writing quality: grammar score 
(number of errors) 

.09 


PP 

Content feedback 

22 

CM 

C 

Writing quality: content rating 

1.22 


PP 

Grammar + content 

22 

LO, CM 

F, C 

Writing quality: grammar score 
(number of errors) 

2.15 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


PP 

Grammar + content 

22 

LO, CM 

F,C 

Writing quality: content rating 

1.02 

Ferris (2006) 

PP 

Grammar feedback, 
essay 1-4 

55 

EC 

F 

Writing quality: total errors 

.36 


PP 

Grammar feedback, 
essay 1-4 

55 

EC 

F 

Writing quality: verb errors 

.50 


PP 

Grammar feedback, 
essay 1-4 

55 

EC 

F 

Writing quality: noun errors 

.20 


PP 

Grammar feedback, 
essay 1-4 

55 

EC 

F 

Writing quality: article errors 
(did not identify more) 

-.10 


PP 

Grammar feedback, 
essay 1-4 

55 

EC 

F 

Writing quality: lexical errors 

.26 


PP 

Grammar feedback, 
essay 1-4 

55 

EC 

F 

Writing quality: sentence errors 

-.01 

Ferris & 

Roberts (2001) 

TC 

Codes vs. no feedback, 
verbs 

42 

LO + EC 

F 

Writing quality: errors marked 

.81 


TC 

Codes vs. no feedback, 

nouns 

42 

LO + EC 

F 

Writing quality: errors marked 

.69 


TC 

Codes vs. no feedback, 
articles 

42 

LO + EC 

F 

Writing quality: errors marked 

.06 


TC 

Codes vs. no feedback, 
word choice 

42 

LO + EC 

F 

Writing quality: errors marked 
(did not identify more) 

-.47 


TC 

Codes vs. no feedback, 
sentence structure 

42 

LO + EC 

F 

Writing quality: errors marked 

.10 


TC 

Codes vs. no feedback, 
total 

42 

LO + EC 

F 

Writing quality: errors marked 

.64 


TC 

No codes vs. no 
feedback, verbs 

39 

LO 

F 

Writing quality: errors marked 

.67 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

No codes vs. no 
feedback, nouns 

39 

LO 

F 

Writing quality: 
errors marked 

.77 


TC 

No codes vs. no 
feedback, articles 

39 

LO 

F 

Writing quality: 
errors marked 

.16 


TC 

No codes vs. no 
feedback, word choice 

39 

LO 

F 

Writing quality: 
errors marked 

-.49 


TC 

No codes vs. no 
feedback, sentence 
structure 

39 

LO 

F 

Writing quality: 
errors marked 

.09 


TC 

No codes vs. no 
feedback, total 

39 

LO 

F 

Writing quality: 
errors marked 

.46 

Hillocks (1982) 

PP 

Prewriting + revision 

75 

CM 

F,C 

Writing quality: rating scale 

.61 


PP 

Prewriting, no revision 

72 

- 

- 

Writing quality: rating scale 

.82 


PP 

Assignment + revision 

67 

CM 

F,C 

Writing quality: rating scale 

.81 

Kamimura 

(2006) 

PP 

Oral peer feedback; high 
proficiency group 

12 

SI 


Writing quality: rating scale 

1.96 


PP 

Oral peer feedback; low 
proficiency group 

12 

SI 


Writing quality: rating scale 

2.33 

Lalande (1982) 

TC 

Error-correction group 
vs. control 

60 

EC 

F 

Writing quality: number of 
errors 

.46 


PP 

Error-correction group 

30 

EC 

F 

Writing quality: number of non- 
lexical errors 

.29 

Lee(1997) 

TC 

Marked vs. unmarked 
(control) 

99 

LO 

F, C 

Error correction score 

2.28 


TC 

Slightly marked vs. 
unmarked 

99 

EX 

F, C 

Error correction score 

2.24 

Matsumura, 

Patthey-Chavez, 

PP 

Revision, lower 
achieving group 

43 

DC, CM 

F 

Writing quality: rating scale, 
content 

.10 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 

Valdes, & 
Gamier (2002) 

PP 

Revision, lower 
achieving group 

43 

DC, CM 

F 

Writing quality: rating scale, 
organization 

.32 


PP 

Revision, lower 
achieving group 

43 

DC, CM 

F 

Writing quality: rating scale, 
writing conventions 

.85 


PP 

Revision, higher 
achieving group 

44 

DC, CM 

F 

Writing quality: rating scale, 
content 

.10 


PP 

Revision, higher 
achieving group 

44 

DC, CM 

F 

Writing quality: rating scale, 
organization 

.10 


PP 

Revision, higher 
achieving group 

44 

DC, CM 

F 

Writing quality: rating scale, 
writing conventions 

.31 

McCutchen, 
Francis, & Kerr 
(1997), 

TC 

Uncued vs. cued, familiar 
text, 7th graders 

23 

EX 

F,C 

Number of errors corrected, 
spelling (uncued did not 
correct more) 

-.75 

study 1 

TC 

Uncued vs. cued, familiar 
text, 7th graders 

23 

EX 

F,C 

Number of errors corrected, 
meaning 

.87 


TC 

Uncued vs. cued, 
unfamiliar text, 7th 
graders 

23 

EX 

F, C 

Number of errors corrected, 
spelling 

-.74 


TC 

Uncued vs. cued, 
unfamiliar text, 7th 
graders 

23 

EX 

F, C 

Number of errors corrected, 
meaning 

.08 


TC 

Uncued vs. cued, familiar 
text, college students 

14 

EX 

F, C 

Number of errors corrected, 
spelling 

-.37 


TC 

Uncued vs. cued, familiar 
text, college students 

14 

EX 

F, C 

Number of errors corrected, 
meaning 

-.37 


TC 

Uncued vs. cued, 
unfamiliar text, college 
students 

14 

EX 

F, C 

Number of errors corrected, 
spelling 

-.17 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

Uncued vs. cued, 
unfamiliar text, college 
students 

14 

EX 

F,C 

Number of errors corrected, 
meaning 

-.55 

McCutchen, 
Francis, & Kerr 

TC 

Uncued vs. cued, familiar 
text, high ability 

8 

EX 

F,C 

Number of errors corrected, 
spelling 

NA 

(1997), 
study 2 

TC 

Uncued vs. cued, familiar 
text, high ability 

8 

EX 

F, C 

Number of errors corrected, 
meaning 

.99 


TC 

Uncued vs. cued, 
unfamiliar text, high 
ability 

8 

EX 

F, C 

Number of errors corrected, 
spelling 

-.99 


TC 

Uncued vs. cued, 
unfamiliar text, high 
ability 

8 

EX 

F, C 

Number of errors corrected, 
meaning 

NA 


TC 

Uncued vs. cued, familiar 
text, middle ability 

8 

EX 

F, C 

Number of errors corrected, 
spelling 

-.45 


TC 

Uncued vs. cued, familiar 
text, middle ability 

8 

EX 

F, C 

Number of errors corrected, 
meaning 

-.20 


TC 

Uncued vs. cued, 
unfamiliar text, middle 
ability 

8 

EX 

F, C 

Number of errors corrected, 
spelling 

NA 


TC 

Uncued vs. cued, 
unfamiliar text, middle 
ability 

8 

EX 

F, C 

Number of errors corrected, 
meaning 

-.50 


TC 

Uncued vs. cued, familiar 
text, low ability 

12 

EX 

F, C 

Number of errors corrected, 
spelling 

0 


TC 

Uncued vs. cued, familiar 
text, low ability 

12 

EX 

F, C 

Number of errors corrected, 
meaning 

.37 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

Uncued vs. cued, 
unfamiliar text, low 
ability 

12 

EX 

F,C 

Number of errors corrected, 
spelling 

-.41 


TC 

Uncued vs. cued, 
unfamiliar text, low 
ability 

12 

EX 

F,C 

Number of errors corrected, 
meaning 

.82 

McGroarty & 
Zhu (1997) 

TC 

Training in peer revision 
vs. no training 

89 

Revision 

F, C 

Holistic scores, 
round 1 

.08 


TC 

Training in peer revision 
vs. no training 

89 

Revision 

F, C 

Holistic scores, 
round 2 

.06 


TC 

Training in peer revision 
vs. no training 

89 

Revision 

F, C 

Holistic scores, portfolio grades 

.25 

Brakel Olson 
(1990) 

PP 

Revision + peer practice 

23 

Revision 

strategies 

F 

Writing quality: rating scale, 
total 

.33 


PP 

Revision 

24 

Revision 

strategies 

F 

Writing quality: rating scale, 
total 

.14 


PP 

Revision + peer practice 

23 

Revision 

strategies 

F 

Writing quality: rating scale, 
rhetorical quality 

.23 


PP 

Revision 

24 

Revision 

strategies 

F 

Writing quality: rating scale, 
rhetorical quality (did not 
improve) 

-.03 


PP 

Revision + peer practice 

23 

Revision 

strategies 

F 

Writing quality: rating scale, 
surface structure 

.37 


PP 

Revision 

24 

Revision 

strategies 

F 

Writing quality: rating scale, 
surface structure 

.47 



Study 

Design 

Description 

Sample 

Type of 

Feedback 

Outcome measure 

Effect 



of groups 

size (n) 

feedback 

focus 


size 

Olson and 

TC 

Comments on content vs. 

16 

CM 

C 

Writing quality: holistic 

1.12 

Raffeld (1987) 

TC 

no comments 

Comments on content vs. 

no comments 

16 

CM 

C 

Writing quality: content 

.10 


TC 

Comments on form vs. 
no comments 

14 

CM 

F 

Writing quality: holistic 

.02 


TC 

Comments on form vs. 

no comments 

14 

CM 

F 

Writing quality: content 

-.90 

Polio, Fleck, & 

PP 

Error correction group, 

34 

DC 

F 

Linguistic accuracy: 

.35 

Leder(1998) 


30 minute 




error-free T-units (EFTs) 



PP 

Error correction group, 

34 

DC 

F 

Linguistic accuracy: number of 

.32 



30 minute 




words in EFTs per total words 



PP 

Error correction group, 

34 

DC 

F 

Linguistic accuracy: 

.50 



60 minute 




error-free T-units (EFTs) 



PP 

Error correction group, 

34 

DC 

F 

Linguistic accuracy: number of 

.46 



60 minute 




words in EFTs per total words 



TC 

Error correction vs. 

65 

DC 

F 

Linguistic accuracy: error-free 

.10 



control, 30 minute 
posttest 




T-units (EFTs) 



TC 

Error correction vs. 

65 

DC 

F 

Linguistic accuracy: number of 

-.10 



control, 30 minute 




words in EFTs per total words 




posttest posttest 




(did not perfonn better) 



TC 

Error correction vs. 

65 

DC 

F 

Linguistic accuracy: error-free 

-.12 



control, 60 minute 




T-units (EFTs) (did not perform 




posttest 




better) 



TC 

Error correction vs. 

65 

DC 

F 

Linguistic accuracy: number of 

-.05 



control, 60 minute 




words in EFTs per total words 




posttest posttest 




(did not perform better) 




Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


PP 

Revision group 1: error 
feedback + revision 

35 

DC, CM, 
revision 

Not given 

Holistic grading, 
no specific focus 

1.19 


PP 

Revision group 2: error 
feedback + revision 

35 

DC, CM, 
revision 

Not given 

Holistic grading, 
no specific focus 

1.07 


PP 

Traditional group: error 
feedback 

30 

DC, LO 

Not given 

Holistic grading, 
no specific focus 

.37 

Sengupta 

(2000) 

TC 

Revision group 1 vs. no 
feedback 

35 

Revision 

instruction 


Holistic grading 

.68 


TC 

Revision group 2 vs. no 
feedback 

35 

Revision 

instruction 


Holistic grading 

.93 

Xiang (2004) 

TC 

Experimental group 
(annotation) vs. control 

58 

SI, SE 

F,C 

Holistic scores: total 

.30 


TC 

Experimental group 
(annotation) vs. control 

58 

SI, SE 

F,C 

Content score 

.30 


TC 

Experimental group 
(annotation) vs. control 

58 

SI, SE 

F, C 

Holistic scores: organization 

1.10 


TC 

Experimental group 
(annotation) vs. control 

58 

SI, SE 

F, C 

Grammar score 

.16 


TC 

Experimental group 
(annotation) vs. control 

58 

SI, SE 

F, C 

Vocabulary score 
(did not perform better) 

-.11 


TC 

Experimental group 

58 

SI, SE 

F, C 

Mechanics score 

.13 


(annotation) vs. control 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 

Zimmerman & 
Kitsantas 
(2002) 

TC 

Social feedback vs. no 
social feedback, no 
model, writing skill 
posttest 

24 

SE 

Not 

specified 

Writing quality: rating scale 

.73 


TC 

Social feedback vs. no 
social feedback, mastery 
model, writing skill 
posttest 

24 

SE 

Not 

specified 

Writing quality: rating scale 

.38 


TC 

Social feedback vs. no 
social feedback, coping 
model, writing skill 
posttest 

24 

SE 

Not 

specified 

Writing quality: rating scale 

.49 


TC 

Social feedback vs. no 
social feedback, no 
model, self-efficacy 
posttest 

24 

SE 

Not 

specified 

Ability self-rating: rating scale 

.61 


TC 

Social feedback vs. no 
social feedback, mastery 
model, self-efficacy 
posttest 

24 

SE 

Not 

specified 

Ability self-rating: rating scale 

-.02 


TC 

Social feedback vs. no 
social feedback, coping 
model, self-efficacy 
posttest 

24 

SE 

Not 

specified 

Ability self-rating: rating scale 

.98 


TC 

Social feedback vs. no 
social feedback, no 

24 

SE 

Not 

specified 

Attitude self-rating: rating scale 

.37 


model, self-satisfaction 
posttest 



Study 

Design 

Description 
of groups 

Sample 
size (n) 

Type of 
feedback 

Feedback 

focus 

Outcome measure 

Effect 

size 


TC 

Social feedback vs. no 
social feedback, mastery 
model, self-satisfaction 
posttest 

24 

SE 

Not 

specified 

Attitude self-rating: rating scale 

.06 


TC 

Social feedback vs. no 
social feedback, coping 
model, self-satisfaction 
posttest 

24 

SE 

Not 

specified 

Attitude self-rating: rating scale 

1.16 


Note. See Table 1 for a list of all variables and a key to the variable codes. 



Appendix C 

Summary of the Final Effect Sizes Included in the Quantitative Meta-Analysis 


Study 

Design 

type 

Language 

group 

Level 

Source of 
feedback 

Mode of 
feedback 

Type of 
feedback 

Focus of 
feedback 

Outcome 

type 

Outcome 

focus 

Effect size 
(or average 
effect size) 

Ashwell (2000) 

PP 

E2 


TE 

WR 

LO 

C/F 

WP 

GR 

1.7 


PP 

E2 


TE 

WR 

LO 

C/F 

WP 

GR 

-0.75 


PP 

E2 


TE 

WR 

LO 

C/F 

WP 

GR 

1.28 


PP 

E2 


TE 

WR 

LO 

C/F 

WP 

C 

0.17 


PP 

E2 


TE 

WR 

LO 

C/F 

WP 

C 

-0.08 


PP 

E2 


TE 

WR 

LO 

C/F 

WP 

C 

-0.34 

Berg (1999) 

TC 

E2 

H 

0 

WR 

0 

OTH 

RV 

C 

1.51 


TC 

E2 

H 

0 

WR 

0 

OTH 

RV 

C 

2.25 


TC 

E2 

H 

0 

WR 

0 

OTH 

RV 

C 

1.9 

Bitchener, Young, 

TC 

E2 

H 

TE 

WR 

M 

F 

WP 

GR 

0.46 a 

& Cameron (2005) 

TC 

E2 

H 

TE 

WR 

M 

F 

WP 

GR 

0.09 a 

Cardelle & Corno 

PP 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

0.65 a 

(1981) 

PP 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

1.74 a 


PP 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

2.25 a 


PP 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

1.08 a 


PP 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

4.02 a 


PP 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

0.67 a 


TC 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

0.76 


TC 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

1.07 


TC 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

1.18 


TC 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

0.88 


TC 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

2.21 


TC 

02 

L 

TE 

OR 

CM 

C/F 

WP 

GR 

1.07 



Study 

Design 

type 

Language 

group 

Level 

Source of 
feedback 

Mode of 
feedback 

Type of 
feedback 

Focus of 
feedback 

Outcome 

type 

Outcome 

focus 

Effect size 
(or average 
effect size) 

Chandler (2003), 

PP 

E2 

H 

TE 

WR 

M 

C/F 

WP 

GR 

1.04 

study 1 

PP 

E2 

H 

TE 

WR 

M 

C/F 

WP 

H 

0.82 


TC 

E2 

H 

TE 

WR 

M 

C/F 

WP 

GR 

0.02 a 


TC 

E2 

H 

TE 

WR 

M 

C/F 

WP 

H 

-o.or 

Chandler (2003), 

PP 

E2 

H 

TE 

WR 

M 

C/F 

WP 

GR 

0.43 

study 2 

PP 

E2 

H 

TE 

WR 

M 

C/F 

WP 

H 

1.77 


TC 

E2 

H 

TE 

WR 

DC 

C/F 

WP 

GR 

1.02 


TC 

E2 

H 

TE 

WR 

M 

C/F 

WP 

GR 

0.42 


TC 

E2 

H 

TE 

WR 

LO 

C/F 

WP 

GR 

0.05 

Davis & Fulton 

PP 

El 

H 

TE 

OR 

CM 

C/F 

WP 

H 

1.86 

(1997) 

PP 

El 

H 

TE 

OR 

CM 

C/F 

WP 

H 

2.47 

Davis & Mahoney 

PP 

El 

H 

TE 

OR 

CM 

C/F 

WP 

H 

1.8 

(1999) 

PP 

El 

H 

TE 

OR 

CM 

C/F 

WP 

H 

1.54 

Fat liman & Whalley 

PP 

E2 

L 

TE 

WR 

LO 

F 

WP 

GR 

2.08 

(1990) 

PP 

E2 

L 

TE 

WR 

LO 

F 

WP 

C 

1.04 


PP 

E2 

L 

TE 

WR 

CM 

C 

WP 

GR 

0.09 


PP 

E2 

L 

TE 

WR 

CM 

C 

WP 

C 

1.22 


PP 

E2 

L 

TE 

WR 

M 

C/F 

WP 

GR 

2.15 


PP 

E2 

L 

TE 

WR 

M 

C/F 

WP 

C 

1.02 

Ferris (2006) 

PP 

E2 


TE 

WR 

LO 

F 

WP 

GR 

0.2 a 

Ferris & Roberts 

TC 

E2 


TE 

WR 

M 

F 

WP 

GR 

0.29 a 

(2001) 

TC 

E2 


TE 

WR 

LO 

F 

WP 

GR 

0.28 a 

Hillocks (1982) 

PP 

El 


TE 

WR 

CM 

C/F 

WP 

H 

0.61 


PP 

El 


TE 

WR 

CM 

C/F 

WP 

H 

0.81 

Kamimura (2006) 

PP 

E2 

H 

O 

OR 

CM 


WP 

H 

1.96 


PP 

E2 

L 

O 

OR 

CM 


WP 

H 

2.33 

Lalande (1982) 

TC 

02 

H 

TE 

WR 

LO 

F 

WP 

H 

0.46 


PP 

02 

H 

TE 

WR 

LO 

F 

WP 

H 

0.29 



Study 

Design 

type 

Language 

group 

Level 

Source of 
feedback 

Mode of 
feedback 

Type of 
feedback 

Focus of 
feedback 

Outcome 

type 

Outcome 

focus 

Effect size 
(or average 
effect size) 

Lee(1997) 

TC 

E2 

L 

0 

WR 

LO 

C/F 

WP 

GR 

2.28 


TC 

E2 

L 

0 

WR 

M 

C/F 

WP 

GR 

2.24 

Matsumura, Patthey- 

PP 

MX 

L 

TE 

WR 

M 

F 

WP 

C 

0.1 

Chavez, Valdes, & 

PP 

MX 

L 

TE 

WR 

M 

F 

WP 

H 

0.32 

Gamier (2002) 

PP 

MX 

L 

TE 

WR 

M 

F 

WP 

GR 

0.85 


PP 

MX 

H 

TE 

WR 

M 

F 

WP 

C 

0.1 


PP 

MX 

H 

TE 

WR 

M 

F 

WP 

H 

0.1 


PP 

MX 

H 

TE 

WR 

M 

F 

WP 

GR 

0.31 

McCutchen, Francis, 

TC 

El 

L 

0 

WR 

CM 

C/F 

WP 

SP 

-0.75 3 

& Kerr (1997), 

TC 

El 

L 

0 

WR 

CM 

C/F 

WP 

c 

0.48 a 

study 1 

TC 

El 

H 

0 

WR 

CM 

C/F 

WP 

SP 

-0.27 3 


TC 

El 

H 

0 

WR 

CM 

C/F 

WP 

c 

-0.46 3 

McCutchen, Francis, 

TC 

El 

H 

0 

WR 

CM 

C/F 

WP 

c 

0.99 3 

& Kerr (1997), 

TC 

El 

H 

0 

WR 

CM 

C/F 

WP 

SP 

-0.99 3 

study 2 

TC 

El 

M 

0 

WR 

CM 

C/F 

WP 

SP 

-0.45 3 


TC 

El 

M 

0 

WR 

CM 

C/F 

WP 

c 

-0.38 3 


TC 

El 

L 

0 

WR 

CM 

C/F 

WP 

SP 

-0.2 3 


TC 

El 

L 

0 

WR 

CM 

C/F 

WP 

c 

0.59 3 

McGroarty & 

TC 

MX 


0 


0 

C/F 

WP 

H 

0.13 3 

Zhu (1997) 











Brakel Olson 

PP 

El 


0 


0 

F 

WP 

H 

0.31 3 

(1990) 

PP 

El 


0 


0 

F 

WP 

H 

0.19 3 

Olson & Raffeld 

TC 

E2 


TE 

WR 

CM 

C 

WP 

H 

1.12 

(1987) 

TC 

E2 


TE 

WR 

CM 

C 

WP 

C 

0.1 


TC 

E2 


TE 

WR 

CM 

F 

WP 

H 

0.02 


TC 

E2 


TE 

WR 

CM 

F 

WP 

C 

-0.9 

Polio, Fleck, & 

PP 

E2 

H 

TE 

WR 

M 

F 

WP 

GR 

0.41 3 

Leder(1998) 

TC 

E2 

H 

TE 

WR 

M 

F 

WP 

GR 

-0.04 3 



Study 

Design 

type 

Language 

group 

Level 

Source of 
feedback 

Mode of 
feedback 

Type of 
feedback 

Focus of 
feedback 

Outcome 

type 

Outcome 

focus 

Effect size 
(or average 
effect size) 


PP 

E2 

H 

TE 

WR 

M 

F 

WP 

H 

1.19 


PP 

E2 

H 

TE 

WR 

M 

F 

WP 

H 

1.07 


PP 

E2 

H 

TE 

WR 

M 

F 

WP 

H 

0.37 

Sengupta (2000) 

TC 

E2 


TE 

OR 

O 


WP 

H 

0.68 


TC 

E2 


TE 

OR 

O 


WP 

H 

0.93 

Xiang (2004) 

TC 

E2 


O 

OR 

CM 

C/F 

WP 

H 

0.7 a 


TC 

E2 


O 

OR 

CM 

C/F 

WP 

C 

0.3 a 


TC 

E2 


O 

OR 

CM 

C/F 

WP 

GR 

0.06 a 

Zimmerman & 

TC 

El 


TE 

OR 

CM 


WP 

H 

0.53 a 

Kitsantas (2002) 

TC 

El 


TE 

OR 

CM 


0 


0.53 a 


Note. See Table 1 for a list of all variables, and a key to the variable codes. 


a Averaged effect sizes. 
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